小数据下的音素级别说话人嵌入的语音合成自适应方法  被引量:10

Speech Synthesis Adaption Method Based on Phoneme-Level Speaker Embedding Under Small Data

在线阅读下载全文

作  者:徐志航 陈博[1,2] 张辉 俞凯[1,2] XU Zhi-Hang;CHEN Bo;ZHANG Hui;YU Kai(MoE Key Lab of Artificial Intelligence,AI Institute,Shanghai Jiao Tong University,Shanghai 200240;Lab of Cross-Media Language Intelligence,Department of Computer Science and Engineering,Shanghai Jiao Tong University,Shanghai 200240;AiSpeech Ltd,Suzhou,Jiangsu 215000)

机构地区:[1]上海交通大学人工智能研究院人工智能教育部重点实验室,上海200240 [2]上海交通大学计算机科学与工程系跨媒体语言智能实验室,上海200240 [3]苏州思必驰信息科技有限公司,江苏苏州215000

出  处:《计算机学报》2022年第5期1003-1017,共15页Chinese Journal of Computers

摘  要:在语音合成中,使用少量的用户录制数据进行说话人自适应一直面临着一个问题:如何在不过分降低合成声音的自然度的情况下,提高合成声音的相似度.现有的句子级别、帧级别说话人嵌入等自适应方法在合成训练集外说话人声音时会出现低相似度的问题.使用少量的用户录制数据微调预训练的语音合成模型的自适应方法尽管能提升合成音频的相似度,但是也常伴随着自然度的下降.为了解决这个问题,本文提出了一种基于音素级别的说话人嵌入的语音合成自适应方法.在训练阶段,从真实的特征片段中提取音素级别的说话人嵌入,控制语音合成模型的训练.在自适应阶段,通过对说话人嵌入预测网络进行快速自适应,在推理阶段代替真实音频得到音素级别说话人嵌入帮助模型合成音频.实验使用了少量真实的用户录制数据,对现在主流的不同粒度的说话人嵌入方法进行了性能比较.实验表明,相比较各种不同的说话人嵌入方法,本文提出的方法在不更新语音合成模型的情况下保持自然度不明显下降,并取得了最好相似度;在更新语音合成模型的情况下,该方法同时达到了最好的自然度和相似度.分析发现音素级别的说话人嵌入方法在几乎不增加自适应训练时间的情况下,提供了更好的模型自适应初始点,有效地提高了自适应模型合成声音的质量.In speech synthesis,the use of a small amount of user-recorded data for speaker adaptation has always been faced with a problem:how to synthesize highly similar speeches without excessively reducing the naturalness of the synthesized speeches.The existing utterance-level and frame-level speaker embedding methods face the problem of low similarity when synthesizing speeches of testing speaker,and the use of a small amount of user recorded data to fine-tune the pre-trained speech synthesis model can improve the similarity of synthesized audio,but it is often accompanied by a decrease in naturalness.To solve this problem,we propose a novel adaptation method for speech synthesis based on phoneme-level speaker embedding.In the training stage,the phoneme-level speaker embedding is extracted from the real feature fragments to control the training of the speech synthesis model.In the adaptation stage,we quickly adapt the speaker embedding predictor network,replacing the real audio in the inference stage to obtain phoneme-level speaker embedding.We use a small amount of real user-recorded data to conduct experiments,and compare the performance of common speaker embedding methods in different grains.Experiments show that compared with various speaker embedding methods,our method maintains no significant decrease in naturalness without updating the speech synthesis model,and achieves the best similarity;in the case of updating the speech synthesis model,our method achieves the best naturalness and similarity at the same time.The analysis found that the phoneme-level speaker embedding method provides a better initial point of model adaptation without increasing the adaptive training time,and effectively improves the quality of the synthesized speeches of the adaptive model.

关 键 词:语音合成 说话人嵌入 时长模型 小数据 说话人自适应 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象