Please use this identifier to cite or link to this item: https://dspace.ctu.edu.vn/jspui/handle/123456789/124143
Title: A SYSTEM FOR GENERATING RECIPES FROM FOOD IMAGES BASED ON VISION-LANGUAGE MODELS
Other Titles: HỆ THỐNG SINH CÔNG THỨC NẤU ĂN TỪ HÌNH ẢNH MÓN ĂN DỰA TRÊN MÔ HÌNH THỊ GIÁC-NGÔN NGỮ
Authors: Lâm, Nhựt Khang
Phạm, Thị Ngọc Thơ
Keywords: CÔNG NGHỆ THÔNG TIN - CHẤT LƯỢNG CAO
Issue Date: 2025
Publisher: Trường Đại Học Cần Thơ
Abstract: Generating complete cooking recipes from food images is a challenging multimodal problem that requires not only accurate visual understanding but also grounded reasoning to infer ingredients, quantities, and cooking procedures. The task is inherently ambiguous, as a single food image may correspond to multiple valid recipes, and existing vision-language models often suffer from hallucination, producing ingredients or steps that are not visually supported. These limitations significantly reduce the reliability of such systems for real-world culinary applications. This thesis investigates the problem of image-to-recipe generation by systematically comparing two fundamentally different approaches. The first approach employs an end-to-end vision-language model based on LLaVA, fine-tuned on a curated dataset of 9,489 food image-recipe pairs, enabling direct generation of recipe titles, ingredients, and cooking steps from visual input. The second approach proposes a modular retrieval-augmented generation (RAG) pipeline, which integrates a fine-tuned BLIP model for visual feature extraction, a vector-based datastore for retrieving semantically relevant recipe information, and a GPT - based language model for composing the final recipe grounded in retrieved evidence. To evaluate the effectiveness of both approaches, a comprehensive experimental framework is established. Quantitative evaluation includes CLIPScore to measure image - text semantic alignment, along with BLEU, ROUGE, and METEOR scores to assess linguistic quality and content overlap with reference recipes. In addition, qualitative error analysis is conducted to examine hallucination patterns, ingredient omission, and step inconsistency. The results indicate that while the fine-tuned LLaVA model produces more fluent and coherent text, it exhibits a higher tendency to hallucinate ingredients and cooking actions. In contrast, the BLIP+RAG pipeline demonstrates stronger grounding capabilities, yielding more accurate and visually consistent ingredient lists and procedural steps. Overall, this study highlights the limitations of purely end-to-end multimodal generation for complex reasoning tasks such as recipe synthesis and demonstrates that retrieval - augmented architectures can significantly reduce hallucination and improve factual consistency. The findings suggest that modular, grounded pipelines provide a more reliable and interpretable solution for practical image-to-recipe generation systems, with potential applications in smart cooking assistants, dietary recommendation platforms, and accessible culinary technologies.
Description: 53 Tr
URI: https://dspace.ctu.edu.vn/jspui/handle/123456789/124143
Appears in Collections:Trường Công nghệ Thông tin & Truyền thông

Files in This Item:
File Description SizeFormat 
_file_
  Restricted Access
1.58 MBAdobe PDF
Your IP: 216.73.216.63


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.