ENHANCING VIETNAMESE SPEECH SYNTHESIS IN AUDIOBOOK THROUGH XTTS-BASED TEXT TO SPEECH AND VOICE CLONING

Nguyễn, Phương Thụy

Vui lòng dùng định danh này để trích dẫn hoặc liên kết đến tài liệu này: https://dspace.ctu.edu.vn/jspui/handle/123456789/126321

Nhan đề:	ENHANCING VIETNAMESE SPEECH SYNTHESIS IN AUDIOBOOK THROUGH XTTS-BASED TEXT TO SPEECH AND VOICE CLONING
Nhan đề khác:	PHÁT TRIỂN TỔNG HỢP GIỌNG ĐỌC TIẾNG VIỆT CHO SÁCH NÓI DỰA TRÊN MÔ HÌNH XTTS VÀ NHÂN BẢN GIỌNG.
Tác giả:	Lâm, Nhựt Khang Nguyễn, Phương Thụy
Từ khoá:	CÔNG NGHỆ THÔNG TIN - CHẤT LƯỢNG CAO
Năm xuất bản:	2025
Nhà xuất bản:	Trường Đại Học Cần Thơ
Tóm tắt:	• This thesis investigates the enhancement of Vietnamese audiobook speech synthesis using the XTTS architecture, a multilingual zero-shot Text-to-Speech model capable of cloning unseen speakers from short reference audio. Although XTTS provides strong baseline performance, its default configuration is not fully optimized for the tonal and prosodic characteristics of Vietnamese, particularly in long-form narration. To address this limitation, the proposed approach introduces a Vietnamese-specific tokenizer, a refined text and audio preprocessing pipeline, and a selective fine-tuning strategy that updates only the GPT-based acoustic module while preserving the remaining pretrained components. Experiments are conducted on the Phoaudiobook dataset, a large-scale Vietnamese audiobook corpus, using 56,879 samples from 276 speakers for training, 998 samples from 34 speakers for validation, and 6,458 samples from 35 speakers for testing. Model performance is assessed using objective metrics, including Character Error Rate (CER), Speaker Encoder Cosine Similarity (SECS), and UTMOS for perceptual naturalness. The fine-tuned XTTSv2 model achieves a CER of 0.0709, indicating stable textual accuracy, a SECS of 0.9126, reflecting very high speaker similarity in zero-shot conditions, and a UTMOS score of 2.2141, demonstrating acceptable perceptual naturalness for audiobook-style speech. These results confirm that targeted adaptation of XTTS effectively improves Vietnamese speech synthesis while maintaining strong speaker identity preservation, providing a solid foundation for future research on Vietnamese dialects and expressive narration..
Mô tả:	44 Tr
Định danh:	https://dspace.ctu.edu.vn/jspui/handle/123456789/126321
Bộ sưu tập:	Trường Công nghệ Thông tin & Truyền thông

Các tập tin trong tài liệu này:

Tập tin	Mô tả	Kích thước	Định dạng
_file_ Giới hạn truy cập		1.81 MB	Adobe PDF
Your IP: 216.73.216.197

Hiển thị đầy đủ biểu ghi tài liệu Xem thống kê

Khi sử dụng các tài liệu trong Thư viện số phải tuân thủ Luật bản quyền.

Thư viện số DSPACE

Thư viện số cho phép quản lý các nguồn tài liệu số như: Văn bản, hình ảnh, âm thanh, phim ảnh...