Vui lòng dùng định danh này để trích dẫn hoặc liên kết đến tài liệu này:
https://dspace.ctu.edu.vn/jspui/handle/123456789/124080Toàn bộ biểu ghi siêu dữ liệu
| Trường DC | Giá trị | Ngôn ngữ |
|---|---|---|
| dc.contributor.advisor | Lâm, Nhựt Khang | - |
| dc.contributor.author | Lâm, Yến Thu | - |
| dc.date.accessioned | 2026-01-09T01:15:44Z | - |
| dc.date.available | 2026-01-09T01:15:44Z | - |
| dc.date.issued | 2025 | - |
| dc.identifier.other | B2111956 | - |
| dc.identifier.uri | https://dspace.ctu.edu.vn/jspui/handle/123456789/124080 | - |
| dc.description | 51 Tr | vi_VN |
| dc.description.abstract | This thesis presents the design and development of a web-based system for automatic image captioning with text-to-speech using artificial intelligence. The proposed system aims to support users in understanding visual content by generating descriptive text and spoken narratives from input images. We design an end-to-end pipeline that integrates image captioning, text generation, and speech synthesis into a single web application. First, an image uploaded by the user is processed by the BLIP model to generate a concise caption that reflects the main visual content. Based on this caption, the T5-Large language model is used to generate a short narrative consisting of five coherent sentences, expanding the original description into a more expressive and contextual story. The generated text is then converted into naturalsounding speech using the Google Text-to-Speech library, allowing users to both read and listen to the output. We implement the system as a website to ensure accessibility and ease of use across different devices without requiring specialized hardware. To evaluate the quality of the generated captions and stories, we conduct experiments using standard natural language generation metrics, including BLEU, ROUGE, and CIDEr. In addition, qualitative analysis is performed to assess semantic relevance, fluency, and coherence of the generated text. The experimental results indicate that the proposed system is capable of producing accurate image descriptions, coherent short stories, and clear speech output. This study demonstrates the effectiveness of combining vision-language models, text generation models, and text-to-speech technology in a unified multimodal system, highlighting its potential applications in assistive technology, digital content creation, and human-computer interaction. | vi_VN |
| dc.language.iso | en | vi_VN |
| dc.publisher | Trường Đại Học Cần Thơ | vi_VN |
| dc.subject | CÔNG NGHỆ THÔNG TIN - CHẤT LƯỢNG CAO | vi_VN |
| dc.title | BUILDING A WEBSITE FOR AUTOMATIC IMAGE CAPTIONING WITH TEXT-TO-SPEECH USING AI | vi_VN |
| dc.title.alternative | XÂY DỰNG WEBSITE TẠO VÀ ĐỌC CHÚ THÍCH ẢNH TỰ ĐỘNG BẰNG AI | vi_VN |
| dc.type | Thesis | vi_VN |
| Bộ sưu tập: | Trường Công nghệ Thông tin & Truyền thông | |
Các tập tin trong tài liệu này:
| Tập tin | Mô tả | Kích thước | Định dạng | |
|---|---|---|---|---|
| _file_ Giới hạn truy cập | 1.5 MB | Adobe PDF | ||
| Your IP: 216.73.216.63 |
Khi sử dụng các tài liệu trong Thư viện số phải tuân thủ Luật bản quyền.