COOKING RECIPE GENERATION FROM FOOD IMAGES USING VISIONLANGUAGE MODEL

Lê, Phương Trung; Ngũ, Công Khanh

Vui lòng dùng định danh này để trích dẫn hoặc liên kết đến tài liệu này: https://dspace.ctu.edu.vn/jspui/handle/123456789/110427

Nhan đề:	COOKING RECIPE GENERATION FROM FOOD IMAGES USING VISIONLANGUAGE MODEL
Nhan đề khác:	NGHIÊN CỨU MÔ HÌNH SINH CÔNG THỨC NẤU ĂN TỪ HÌNH ẢNH SỬ DỤNG MÔ HÌNH NGÔN NGỮ VÀ THỊ GIÁC MÁY TÍNH
Tác giả:	Lâm, Nhựt Khang Lê, Phương Trung Ngũ, Công Khanh
Từ khoá:	CÔNG NGHỆ THÔNG TIN - CHẤT LƯỢNG CAO
Năm xuất bản:	2024
Nhà xuất bản:	Trường Đại Học Cần Thơ
Tóm tắt:	Computer vision and natural language processing are becoming more popular and dominating deep learning field, tackling real practical problems. Specifically, with the advances of large language models, those models can handle almost as many as the most difficult tasks in deep learning such as computer vision, and among others. The current problem with image-to-recipe methods is retrieval-based and their success are heavily due to the dataset’s quantitative and qualitative attributes, as well as the quality of learned embeddings. Meanwhile, the introduction of powerful attention-based vision and language models presents a promising avenue for accurate and generalizable recipe generation, which has yet to be extensively explored. The idea is to leverage the BLIP model to extract and generate titles, take advantage of a collection of open-sourced multimodal large language models Llama 3.2, Mistral and Llama 3.2-Vision in various scale, optimizing for specialized tasks such as visual recognition, image captioning, reasoning, and instruction. Empirical results demonstrate the effectiveness of our approach, underscoring the potential for future developments in this field. Our models achieved notable BLEU and ROUGE scores, with Mistral 7B and Llama 3.2 8B finetuned versions generating clear and effective cooking instructions. Mistral 7B scored 0.38 and Llama 3.2 8B scored 0.353 in BLEU, respectively. The Llama 3 Vision model, with prompt engineering techniques, achieved the highest BLEU score of 0.5. Additionally, CIDEr scores further reflect the models’ alignment with human judgment, with Llama 3 Vision achieving the highest score of 1.076, underscoring its strong performance in generating semantically accurate and human-like recipe descriptions.
Mô tả:	67 Tr
Định danh:	https://dspace.ctu.edu.vn/jspui/handle/123456789/110427
Bộ sưu tập:	Trường Công nghệ Thông tin & Truyền thông

Các tập tin trong tài liệu này:

Tập tin	Mô tả	Kích thước	Định dạng
_file_ Giới hạn truy cập		1.67 MB	Adobe PDF
Your IP: 216.73.217.104

Hiển thị đầy đủ biểu ghi tài liệu Xem thống kê

Khi sử dụng các tài liệu trong Thư viện số phải tuân thủ Luật bản quyền.

Thư viện số DSPACE

Thư viện số cho phép quản lý các nguồn tài liệu số như: Văn bản, hình ảnh, âm thanh, phim ảnh...