Please use this identifier to cite or link to this item: https://dspace.ctu.edu.vn/jspui/handle/123456789/124143
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorLâm, Nhựt Khang-
dc.contributor.authorPhạm, Thị Ngọc Thơ-
dc.date.accessioned2026-01-10T02:47:17Z-
dc.date.available2026-01-10T02:47:17Z-
dc.date.issued2025-
dc.identifier.otherB2112011-
dc.identifier.urihttps://dspace.ctu.edu.vn/jspui/handle/123456789/124143-
dc.description53 Trvi_VN
dc.description.abstractGenerating complete cooking recipes from food images is a challenging multimodal problem that requires not only accurate visual understanding but also grounded reasoning to infer ingredients, quantities, and cooking procedures. The task is inherently ambiguous, as a single food image may correspond to multiple valid recipes, and existing vision-language models often suffer from hallucination, producing ingredients or steps that are not visually supported. These limitations significantly reduce the reliability of such systems for real-world culinary applications. This thesis investigates the problem of image-to-recipe generation by systematically comparing two fundamentally different approaches. The first approach employs an end-to-end vision-language model based on LLaVA, fine-tuned on a curated dataset of 9,489 food image-recipe pairs, enabling direct generation of recipe titles, ingredients, and cooking steps from visual input. The second approach proposes a modular retrieval-augmented generation (RAG) pipeline, which integrates a fine-tuned BLIP model for visual feature extraction, a vector-based datastore for retrieving semantically relevant recipe information, and a GPT - based language model for composing the final recipe grounded in retrieved evidence. To evaluate the effectiveness of both approaches, a comprehensive experimental framework is established. Quantitative evaluation includes CLIPScore to measure image - text semantic alignment, along with BLEU, ROUGE, and METEOR scores to assess linguistic quality and content overlap with reference recipes. In addition, qualitative error analysis is conducted to examine hallucination patterns, ingredient omission, and step inconsistency. The results indicate that while the fine-tuned LLaVA model produces more fluent and coherent text, it exhibits a higher tendency to hallucinate ingredients and cooking actions. In contrast, the BLIP+RAG pipeline demonstrates stronger grounding capabilities, yielding more accurate and visually consistent ingredient lists and procedural steps. Overall, this study highlights the limitations of purely end-to-end multimodal generation for complex reasoning tasks such as recipe synthesis and demonstrates that retrieval - augmented architectures can significantly reduce hallucination and improve factual consistency. The findings suggest that modular, grounded pipelines provide a more reliable and interpretable solution for practical image-to-recipe generation systems, with potential applications in smart cooking assistants, dietary recommendation platforms, and accessible culinary technologies.vi_VN
dc.language.isoenvi_VN
dc.publisherTrường Đại Học Cần Thơvi_VN
dc.subjectCÔNG NGHỆ THÔNG TIN - CHẤT LƯỢNG CAOvi_VN
dc.titleA SYSTEM FOR GENERATING RECIPES FROM FOOD IMAGES BASED ON VISION-LANGUAGE MODELSvi_VN
dc.title.alternativeHỆ THỐNG SINH CÔNG THỨC NẤU ĂN TỪ HÌNH ẢNH MÓN ĂN DỰA TRÊN MÔ HÌNH THỊ GIÁC-NGÔN NGỮvi_VN
dc.typeThesisvi_VN
Appears in Collections:Trường Công nghệ Thông tin & Truyền thông

Files in This Item:
File Description SizeFormat 
_file_
  Restricted Access
1.58 MBAdobe PDF
Your IP: 216.73.216.143


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.