A SYSTEM FOR GENERATING RECIPES FROM FOOD IMAGES BASED ON VISION-LANGUAGE MODELS

Phạm, Thị Ngọc Thơ

Please use this identifier to cite or link to this item: https://dspace.ctu.edu.vn/jspui/handle/123456789/124143

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Lâm, Nhựt Khang	-
dc.contributor.author	Phạm, Thị Ngọc Thơ	-
dc.date.accessioned	2026-01-10T02:47:17Z	-
dc.date.available	2026-01-10T02:47:17Z	-
dc.date.issued	2025	-
dc.identifier.other	B2112011	-
dc.identifier.uri	https://dspace.ctu.edu.vn/jspui/handle/123456789/124143	-
dc.description	53 Tr	vi_VN
dc.description.abstract	Generating complete cooking recipes from food images is a challenging multimodal problem that requires not only accurate visual understanding but also grounded reasoning to infer ingredients, quantities, and cooking procedures. The task is inherently ambiguous, as a single food image may correspond to multiple valid recipes, and existing vision-language models often suffer from hallucination, producing ingredients or steps that are not visually supported. These limitations significantly reduce the reliability of such systems for real-world culinary applications. This thesis investigates the problem of image-to-recipe generation by systematically comparing two fundamentally different approaches. The first approach employs an end-to-end vision-language model based on LLaVA, fine-tuned on a curated dataset of 9,489 food image-recipe pairs, enabling direct generation of recipe titles, ingredients, and cooking steps from visual input. The second approach proposes a modular retrieval-augmented generation (RAG) pipeline, which integrates a fine-tuned BLIP model for visual feature extraction, a vector-based datastore for retrieving semantically relevant recipe information, and a GPT - based language model for composing the final recipe grounded in retrieved evidence. To evaluate the effectiveness of both approaches, a comprehensive experimental framework is established. Quantitative evaluation includes CLIPScore to measure image - text semantic alignment, along with BLEU, ROUGE, and METEOR scores to assess linguistic quality and content overlap with reference recipes. In addition, qualitative error analysis is conducted to examine hallucination patterns, ingredient omission, and step inconsistency. The results indicate that while the fine-tuned LLaVA model produces more fluent and coherent text, it exhibits a higher tendency to hallucinate ingredients and cooking actions. In contrast, the BLIP+RAG pipeline demonstrates stronger grounding capabilities, yielding more accurate and visually consistent ingredient lists and procedural steps. Overall, this study highlights the limitations of purely end-to-end multimodal generation for complex reasoning tasks such as recipe synthesis and demonstrates that retrieval - augmented architectures can significantly reduce hallucination and improve factual consistency. The findings suggest that modular, grounded pipelines provide a more reliable and interpretable solution for practical image-to-recipe generation systems, with potential applications in smart cooking assistants, dietary recommendation platforms, and accessible culinary technologies.	vi_VN
dc.language.iso	en	vi_VN
dc.publisher	Trường Đại Học Cần Thơ	vi_VN
dc.subject	CÔNG NGHỆ THÔNG TIN - CHẤT LƯỢNG CAO	vi_VN
dc.title	A SYSTEM FOR GENERATING RECIPES FROM FOOD IMAGES BASED ON VISION-LANGUAGE MODELS	vi_VN
dc.title.alternative	HỆ THỐNG SINH CÔNG THỨC NẤU ĂN TỪ HÌNH ẢNH MÓN ĂN DỰA TRÊN MÔ HÌNH THỊ GIÁC-NGÔN NGỮ	vi_VN
dc.type	Thesis	vi_VN
Appears in Collections:	Trường Công nghệ Thông tin & Truyền thông

Files in This Item:

File	Description	Size	Format
_file_ Restricted Access		1.58 MB	Adobe PDF
Your IP: 216.73.216.15

Show simple item record

LRC Digital repo

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets