Vui lòng dùng định danh này để trích dẫn hoặc liên kết đến tài liệu này:
https://dspace.ctu.edu.vn/jspui/handle/123456789/124143Toàn bộ biểu ghi siêu dữ liệu
| Trường DC | Giá trị | Ngôn ngữ |
|---|---|---|
| dc.contributor.advisor | Lâm, Nhựt Khang | - |
| dc.contributor.author | Phạm, Thị Ngọc Thơ | - |
| dc.date.accessioned | 2026-01-10T02:47:17Z | - |
| dc.date.available | 2026-01-10T02:47:17Z | - |
| dc.date.issued | 2025 | - |
| dc.identifier.other | B2112011 | - |
| dc.identifier.uri | https://dspace.ctu.edu.vn/jspui/handle/123456789/124143 | - |
| dc.description | 53 Tr | vi_VN |
| dc.description.abstract | Generating complete cooking recipes from food images is a challenging multimodal problem that requires not only accurate visual understanding but also grounded reasoning to infer ingredients, quantities, and cooking procedures. The task is inherently ambiguous, as a single food image may correspond to multiple valid recipes, and existing vision-language models often suffer from hallucination, producing ingredients or steps that are not visually supported. These limitations significantly reduce the reliability of such systems for real-world culinary applications. This thesis investigates the problem of image-to-recipe generation by systematically comparing two fundamentally different approaches. The first approach employs an end-to-end vision-language model based on LLaVA, fine-tuned on a curated dataset of 9,489 food image-recipe pairs, enabling direct generation of recipe titles, ingredients, and cooking steps from visual input. The second approach proposes a modular retrieval-augmented generation (RAG) pipeline, which integrates a fine-tuned BLIP model for visual feature extraction, a vector-based datastore for retrieving semantically relevant recipe information, and a GPT - based language model for composing the final recipe grounded in retrieved evidence. To evaluate the effectiveness of both approaches, a comprehensive experimental framework is established. Quantitative evaluation includes CLIPScore to measure image - text semantic alignment, along with BLEU, ROUGE, and METEOR scores to assess linguistic quality and content overlap with reference recipes. In addition, qualitative error analysis is conducted to examine hallucination patterns, ingredient omission, and step inconsistency. The results indicate that while the fine-tuned LLaVA model produces more fluent and coherent text, it exhibits a higher tendency to hallucinate ingredients and cooking actions. In contrast, the BLIP+RAG pipeline demonstrates stronger grounding capabilities, yielding more accurate and visually consistent ingredient lists and procedural steps. Overall, this study highlights the limitations of purely end-to-end multimodal generation for complex reasoning tasks such as recipe synthesis and demonstrates that retrieval - augmented architectures can significantly reduce hallucination and improve factual consistency. The findings suggest that modular, grounded pipelines provide a more reliable and interpretable solution for practical image-to-recipe generation systems, with potential applications in smart cooking assistants, dietary recommendation platforms, and accessible culinary technologies. | vi_VN |
| dc.language.iso | en | vi_VN |
| dc.publisher | Trường Đại Học Cần Thơ | vi_VN |
| dc.subject | CÔNG NGHỆ THÔNG TIN - CHẤT LƯỢNG CAO | vi_VN |
| dc.title | A SYSTEM FOR GENERATING RECIPES FROM FOOD IMAGES BASED ON VISION-LANGUAGE MODELS | vi_VN |
| dc.title.alternative | HỆ THỐNG SINH CÔNG THỨC NẤU ĂN TỪ HÌNH ẢNH MÓN ĂN DỰA TRÊN MÔ HÌNH THỊ GIÁC-NGÔN NGỮ | vi_VN |
| dc.type | Thesis | vi_VN |
| Bộ sưu tập: | Trường Công nghệ Thông tin & Truyền thông | |
Các tập tin trong tài liệu này:
| Tập tin | Mô tả | Kích thước | Định dạng | |
|---|---|---|---|---|
| _file_ Giới hạn truy cập | 1.58 MB | Adobe PDF | ||
| Your IP: 216.73.216.63 |
Khi sử dụng các tài liệu trong Thư viện số phải tuân thủ Luật bản quyền.