Please use this identifier to cite or link to this item: https://dspace.ctu.edu.vn/jspui/handle/123456789/75449
Title: IMAGE CAPTION GENERATOR WITH A COMBINATION BETWEEN CONVOLUTIONAL NEURAL NETWORK AND LONG SHORT-TERM MEMORY
Authors: Nguyễn, Thanh Hải
Nguyễn, Thị Thúy Duy
Keywords: CÔNG NGHỆ THÔNG TIN-CHẤT LƯỢNG CAO
Issue Date: 2021
Publisher: Trường Đại Học Cần Thơ
Abstract: With the rapid advancement of digitization, there is an enormous volume of imagery, as well as a large number of related texts. Automatic image captioning has sparked a lot of scientific attention recently. The goal of automatic image captioning is to automatically generate properly formed English sentences to describe the content of an image, which has a significant impact in a variety of domains such as virtual assistants, image indexing, editing application recommendation, and disabled assistance. Although describing an image is a simple task for a human, it is quite complex for a machine to do so. In this project, I describe a CNN - LSTM joint model that can perform automatic image captioning using CNN and LSTM. One encoder and one decoder make up the model. The encoder is based on a convolutional neural network, which embeds the given image into a fixed length vector and builds an elaborate representation of it. To predict the next sentence, the decoder uses LSTM, a recurrent neural network, and a soft attention mechanism to selectively focus attention over certain areas of an image. I assessed the model using BLEU metrics after training it on the Flickr8k dataset to optimize the likelihood of the target description sentence given the training images. I will use a merge model to combine the image vector and the partial caption. It can be implemented through three major steps: processing the sequence from the text, extracting the feature vector from the image and decoding the output by concatenating the above two layers. Besides that, to evaluate model performance, I generate multiple sentences using Beam Search and BLEU. The experiments show that the method can generate captions with relatively accurate content and less training memory. I use the Flickr8k dataset consisting of 8000 images paired with five different captions, which provide precise descriptions of the salient entities and events. For training, I use 6000 images, 1000 for test and 1000 for development, while Flickr8k text includes text files describing train set, test set. The best results were obtained when testing the model with BLEU-1, Greedy, and Beam Search with k = 5 or k = 7 all over 60 in BLEU scores.
Description: 51 Tr
URI: https://dspace.ctu.edu.vn/jspui/handle/123456789/75449
Appears in Collections:Trường Công nghệ Thông tin & Truyền thông

Files in This Item:
File Description SizeFormat 
_file_
  Restricted Access
1.86 MBAdobe PDF
Your IP: 3.15.235.196


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.