IMPROVING IMAGE CAPTION USING CLIP

Dương, Thị Yến Nhi

Please use this identifier to cite or link to this item: https://dspace.ctu.edu.vn/jspui/handle/123456789/110697

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Trần, Công Án	-
dc.contributor.author	Dương, Thị Yến Nhi	-
dc.date.accessioned	2025-02-03T07:03:09Z	-
dc.date.available	2025-02-03T07:03:09Z	-
dc.date.issued	2024	-
dc.identifier.other	B2017065	-
dc.identifier.uri	https://dspace.ctu.edu.vn/jspui/handle/123456789/110697	-
dc.description	37 Tr	vi_VN
dc.description.abstract	Image captioning, the task of generating textual descriptions for visual content, has seen significant advancements with the integration of pre-trained vision-language models. This work explores the application of CLIP’s robust cross-modal embeddings in a CLIPbased captioning framework. The proposed method employs CLIP as a foundational model and fine-tunes a lightweight transformer-based decoder on top of CLIP embeddings. By retaining the pre-trained weights of CLIP and adjusting only the "Prefix" and "Decoder" modules, the framework ensures efficient and contextually rich caption generation. The model is evaluated using standard datasets to assess its performance. The integration of CLIP-based embeddings addresses the limitations of traditional image captioning models, such as the need for extensive task-specific training. By exploiting pre-trained representations, this approach reduces computational requirements while enhancing descriptive accuracy and semantic relevance. The method achieves competitive results on standard metrics like CIDEr, BLEU, and SPICE, demonstrating substantial improvements in caption quality and relevance. This research highlights the potential of CLIP-based architectures for building efficient and high-performing image captioning systems. Secifically, the ROUGE-L, CIDEr, SPICE and training time of CLIP + GPT2 using Conceptual captions are 26.71, 87.26, 18.5 and 65 hours. For COCO Captions the B@4, METER, CIDEr, SPICE and training time of CLIP + GPT2; transformer are 33.53, 28.43, 113.08, 21.05 and 6 hours.	vi_VN
dc.language.iso	vi	vi_VN
dc.publisher	Trường Đại Học Cần Thơ	vi_VN
dc.subject	CÔNG NGHỆ THÔNG TIN - CHẤT LƯỢNG CAO	vi_VN
dc.title	IMPROVING IMAGE CAPTION USING CLIP	vi_VN
dc.title.alternative	CẢI THIỆN CHÚ THÍCH HÌNH ẢNH SỬ DỤNG CLIP	vi_VN
dc.type	Thesis	vi_VN
Appears in Collections:	Trường Công nghệ Thông tin & Truyền thông

Files in This Item:

File	Description	Size	Format
_file_ Restricted Access		1.08 MB	Adobe PDF
Your IP: 216.73.216.219

Show simple item record

LRC Digital repo

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets