基于条件变分自编码器的端到端情感语音合成方法被引量：4

End-to-End Emotional Speech Synthesis Method Based on Conditional Variational Autoencoder

作　　者：张建明[1] 彭锦涛贾洪杰毛启容[1,2] ZHANG Jianming;PENG Jintao;JIA Hongjie;MAO Qirong(School of Computer Science and Communication Engineering,Jiangsu University,Zhenjiang,Jiangsu 212013,China;Jiangsu Engineering Research Center of Big Data Ubiquitous Perception and Intelligent Agriculture Applications,Zhenjiang,Jiangsu 212013,China)

机构地区：[1]江苏大学计算机科学与通信工程学院,江苏镇江212013 [2]江苏省大数据泛在感知与智能农业应用工程研究中心,江苏镇江212013

出　　处：《信号处理》2023年第4期678-687,共10页Journal of Signal Processing

基　　金：国家自然科学基金重点项目(U1836220);国家自然科学基金面上项目(62176106);国家自然科学基金青年项目(61906077);江苏省重点研究开发计划(BE2020036);江苏省自然科学基金青年项目(BK20190838);中国博士后科学基金项目(2020T130257,2020M671376)。

摘　　要：情感语音合成作为语音合成的一个重要分支,在人机交互领域得到了广泛的关注。如何获得更好的情感嵌入并有效地将其引入到语音合成声学模型中是目前主要存在的问题。表达性语音合成往往从参考音频中获得风格嵌入,但只能学习到风格的平均表示,无法合成显著的情感语音。该文提出一种基于条件变分自编码器的端到端情感语音合成方法(Conditional Duration-Tacotron,CD-Tacotron),该方法在Tacotron2模型的基础上进行改进,引入条件变分自编码器从语音信号中解耦学习情感信息,并将其作为条件因子,然后通过使用情感标签将其编码为向量后与其他风格信息拼接,最终通过声谱预测网络合成情感语音。在ESD数据集上的主观和客观实验表明,与目前主流的方法GST-Tacotron和VAE-Tacotron相比,该文提出的方法可以生成更具表现力的情感语音。Emotional speech synthesis,as an important branch of speech synthesis,has received extensive attention in the field of human-computer interaction.How to obtain better emotional embedding and effectively inject them into text-tospeech acoustic models is currently the main problem.Expressive speech synthesis often obtains style embeddings from reference audio,but can only learn the average representation of style,and cannot express an explicit emotional state.In this paper,an effective emotion control method CD-Tacotron is proposed for end-to-end speech synthesis systems,which is improved on the basis of Tacotron2 model by introducing a conditional variational autoencoder to disentangle the emotional information from speech signals and take it as a conditional factor.The emotion labels are encoded into vectors to concatenate with other style information.Other style information is encoded by the latent space and obeys the standard normal distribution.Finally,emotional speech is synthesized through the spectrum prediction network.Subjective and objective experiments on the ESD dataset show that the method proposed in this paper can generate more expressive emotional speech compared to GST-Tacotron and VAE-Tacotron.

关键词：情感语音合成条件变分自编码器端到端 Tacotron

分类号：TN912.33[电子电信—通信与信息系统]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于条件变分自编码器的端到端情感语音合成方法被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于条件变分自编码器的端到端情感语音合成方法 被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于条件变分自编码器的端到端情感语音合成方法被引量：4