IMAGE CAPTION GENERATOR WITH A COMBINATION BETWEEN CONVOLUTIONAL NEURAL NETWORK AND LONG SHORT-TERM MEMORY

Nguyễn, Thị Thúy Duy

Vui lòng dùng định danh này để trích dẫn hoặc liên kết đến tài liệu này: https://dspace.ctu.edu.vn/jspui/handle/123456789/75449

Nhan đề:	IMAGE CAPTION GENERATOR WITH A COMBINATION BETWEEN CONVOLUTIONAL NEURAL NETWORK AND LONG SHORT-TERM MEMORY
Tác giả:	Nguyễn, Thanh Hải Nguyễn, Thị Thúy Duy
Từ khoá:	CÔNG NGHỆ THÔNG TIN-CHẤT LƯỢNG CAO
Năm xuất bản:	2021
Nhà xuất bản:	Trường Đại Học Cần Thơ
Tóm tắt:	With the rapid advancement of digitization, there is an enormous volume of imagery, as well as a large number of related texts. Automatic image captioning has sparked a lot of scientific attention recently. The goal of automatic image captioning is to automatically generate properly formed English sentences to describe the content of an image, which has a significant impact in a variety of domains such as virtual assistants, image indexing, editing application recommendation, and disabled assistance. Although describing an image is a simple task for a human, it is quite complex for a machine to do so. In this project, I describe a CNN - LSTM joint model that can perform automatic image captioning using CNN and LSTM. One encoder and one decoder make up the model. The encoder is based on a convolutional neural network, which embeds the given image into a fixed length vector and builds an elaborate representation of it. To predict the next sentence, the decoder uses LSTM, a recurrent neural network, and a soft attention mechanism to selectively focus attention over certain areas of an image. I assessed the model using BLEU metrics after training it on the Flickr8k dataset to optimize the likelihood of the target description sentence given the training images. I will use a merge model to combine the image vector and the partial caption. It can be implemented through three major steps: processing the sequence from the text, extracting the feature vector from the image and decoding the output by concatenating the above two layers. Besides that, to evaluate model performance, I generate multiple sentences using Beam Search and BLEU. The experiments show that the method can generate captions with relatively accurate content and less training memory. I use the Flickr8k dataset consisting of 8000 images paired with five different captions, which provide precise descriptions of the salient entities and events. For training, I use 6000 images, 1000 for test and 1000 for development, while Flickr8k text includes text files describing train set, test set. The best results were obtained when testing the model with BLEU-1, Greedy, and Beam Search with k = 5 or k = 7 all over 60 in BLEU scores.
Mô tả:	51 Tr
Định danh:	https://dspace.ctu.edu.vn/jspui/handle/123456789/75449
Bộ sưu tập:	Trường Công nghệ Thông tin & Truyền thông

Các tập tin trong tài liệu này:

Tập tin	Mô tả	Kích thước	Định dạng
_file_ Giới hạn truy cập		1.86 MB	Adobe PDF
Your IP: 216.73.216.226

Hiển thị đầy đủ biểu ghi tài liệu Xem thống kê

Khi sử dụng các tài liệu trong Thư viện số phải tuân thủ Luật bản quyền.

Thư viện số DSPACE

Thư viện số cho phép quản lý các nguồn tài liệu số như: Văn bản, hình ảnh, âm thanh, phim ảnh...