Emotion transplantation approach for VLSP 2022

This paper presents our approach that addresses the problem of transplanting a source speaker’s emotional expression to a target speaker, one of the Vietnamese Language and Speech Processsing (VLSP) 2022 TTS tasks. Our approach includes a complete data preprocessing pipeline and two training algorithms. | Journal of Computer Science and Cybernetics 2022 369 379 DOI no 1813-9663 18236 EMOTION TRANSPLANTATION APPROACH FOR VLSP 2022 THANG NGUYEN VAN1 LONG LUONG THANH1 HUAN VU2 1 Innovation Center VNPT-IT Ha Noi Viet Nam 2 University of Transport and Communications Ha Noi Viet Nam Abstract. Emotional speech synthesis is a challenging task in speech processing. To build an emotional Text-to-speech TTS system one would need to have a quality emotional dataset of the target speaker. However collecting such data is difficult sometimes even impossible. This paper presents our approach that addresses the problem of transplanting a source speaker s emotional expression to a target speaker one of the Vietnamese Language and Speech Processsing VLSP 2022 TTS tasks. Our approach includes a complete data pre- processing pipeline and two training algorithms. We first train a source speaker s expressive TTS model then adapt the voice characteristics for the target speaker. Empirical results have shown the efficacy of our method in generating the expressive speech of a speaker under a limited training data regime. Keywords. Emotional speech synthesis Emotion transplantation Text-to-speech. 1. INTRODUCTION Traditional TTS systems aim to synthesize human-like speech from texts. It is an impor- tant feature that is utilized widely in many applications such as virtual assistance virtual call centers . Thanks to recent advances in deep learning models such as Tacotron 2 14 Fastspeech 2 13 and VITS 4 have successfully shown to be able to generate high-quality speech. To expand further researchers have tried to develop TTS models that are able to include emotional expression to generat speech 7 8 15 17 . These approaches often rely on an emotional speech dataset from the target speaker along with emotion embedding techniques that help the model learn different characteristics of each emotion. However such a dataset is not always available for every speaker and building .

Không thể tạo bản xem trước, hãy bấm tải xuống
TÀI LIỆU MỚI ĐĂNG
7    709    3    10-06-2024
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.