轻量且基频可预测的端到端语音合成系统  

A Lightweight End-to-End Speech Synthesis System with Pitch Prediction

在线阅读下载全文

作  者:梁婷 艾斯卡尔·艾木都拉[1] 刘煌 徐颖 Liang Ting;Askar Hamdulla;Liu Huang;Xu Ying(School of Information Science and Engineering,Xinjiang University,Wulumuqi 830046,China;Shanghai GERZZ Interactive Information Technology Co.,Ltd,Shanghai 200000,China)

机构地区:[1]新疆大学信息科学与工程学院,新疆乌鲁木齐830046 [2]上海格子互动信息技术有限公司,上海200000

出  处:《南京师范大学学报(工程技术版)》2023年第4期37-42,共6页Journal of Nanjing Normal University(Engineering and Technology Edition)

摘  要:提出了一种轻量级的基频可控的完全端到端的语音合成模型.该模型基于目前最流行的完全的端到端的语音合成模型VITS做出了三处改进,使得合成的语音韵律感更强,从而提高语音合成的自然度和表现力,同时提高发音的准确性和推理速度.首先,引入帧先验网络得到细粒度的均值方差表示,且引入音素预测器和CTC loss以提高发音的稳定性.其次,在模型中使用音素真实时长对齐文本和音频帧,并且加入F0预测器,增强语音的韵律感.另外,用多频带和短时傅立叶变换替换原始模型中的Decoder,有效提高了模型的推理速度.最后,使用MOS测试和RTF作为实验主观和客观的评判标准.实验证明,模型在音频自然度和表现力方面提高了至少5%,且相比原始VITS推理速度提高了3倍.This paper proposes a lightweight end-to-end speech synthesis model with pitch prediction.The model in this paper is based on VITS,an end-to-end speech generation model which adopts VAE-based posterior encoder augmented with normalizing flow based prior encoder and adversarial decoder,and three improvements are made to make the synthesized speech more rhythmical and more stable in a more efficient way.To be more specific.Firstly,to improve the accuracy of pronunciation and naturalness of speech,we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features,modeling the rich acoustic variation in speech,and phone predictor and CTC loss are introduced to improve the stability of pronunciation.Secondly,the ground truth duration of phonemes is used for alignment of text and frame in the model,and F0 predictor is added to enhance the sense of rhythm of speech.Thirdly,the decoder in the original VITS model with multi-band generation and inverse short-time Fourier transform,which effectively improves the inference speed of the model.Experiments show that the proposed model greatly improves the naturalness and expressiveness by 5%from the MOS(mean opinion score)value and improves the inference speed by 3 times from RTF(real-time factor)compared with the original VITS.

关 键 词:端到端语音合成 韵律预测 逆快速傅立叶变换 变分字编码器  多频带 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象