一种融合多通道CycleGAN和Mixup的情感语音合成方法  被引量:4

Emotion speech synthesis method fusing multichannel CycleGAN and Mixup

在线阅读下载全文

作  者:贾宁[1] 郑纯军[1] JIA Ning;ZHENG Chunjun(Dalian Neusoft University of Information,Dalian 116023,China)

机构地区:[1]大连东软信息学院,辽宁大连116023

出  处:《现代电子技术》2022年第15期80-87,共8页Modern Electronics Technique

基  金:辽宁省教育厅校际合作项目(86896244);大连市科技计划项目(2019RQ120)。

摘  要:现有的循环一致性生成对抗网络(CycleGAN)提供了一个双向情感语料转化的突破,但是真实目标和转换后的语音之间仍然存在很大的差距。为了缩小这一差距,提出融合多通道CycleGAN和Mixup的情感语音合成方法,包含三个阶段:多通道CycleGAN、基于Mixup的损失估计和基于Mixup的有效情感区域加重。其中,设计门控单元GTLU和音频显著性区域的图像表达方法,结合基于改进GTLU的全局CycleGAN和基于显著性区域的局部CycleGAN构成了第一个阶段中的多通道CycleGAN,基于Mixup方法设计了损失的计算方法和情感区域的不同加重程度计算。结合多项流行的语音合成方法,在IEMOCAP情感语料库上实施了多组生成情感语料的对比实验,利用双向三层长短期记忆网络(LSTM)模型作为验证模型,实验结果证明,所提出的情感语音合成方法获得的语音,其平均意见得分(MOS)和语音情感识别精度(UA)均有不同程度的提升,分别获得3.4%和2.7%的改善,在主观评价和客观实验上均优于现有的GANs模型,从而确保该模型生成语音具备高可靠性和良好的自然度。The existing cycle⁃consistent generative adversarial network(CycleGAN)provides a breakthrough of the transformation of two⁃way emotional corpus information.However,there is still a big gap between the real target speech and the synthesis speech.Therefore,an emotion speech synthesis method combining multichannel CycleGAN and Mixup is proposed to narrow this gap.This method includes three stages,named multichannel CycleGAN,loss estimation based on Mixup and effective emotion region aggravation based on Mixup.The gating unit GTLU(gated tanh linear unit)and the image expression method of audio saliency region are designed.The global CycleGAN based on the improved GTLU and the local CycleGAN based on the saliency region are combined to form the multi⁃channel CycleGAN in the first stage.On the basis of the Mixup method,the calculation method of loss and the different aggravation degree of the emotional region are designed.In combination with the several popular speech synthesis methods,a contrast experiment of generating several groups of emotional corpuses was carried out on the emotion corpus of interactive emotional dyadic motion capture(IEMOCAP).The bidirectional three⁃layer long⁃short term memory(LSTM)model was used as the verification model.The experimental results show that the mean opinion score(MOS)and the unweighted accuracy(UA)of the speech emotion recognition of the speech generated by the proposed emotional speech synthesis method have been increased by 3.4%and 2.7%respectively.Therefore,the proposed method is superior to the existing GANs model in terms of subjective evaluation and objective experiment,and ensures that the speech has high reliability and good naturalness.

关 键 词:情感语音合成 多通道CycleGAN Mixup GTLU 图像重构 损失估计 有效情感区域加重 

分 类 号:TN912.3-34[电子电信—通信与信息系统] TP183[电子电信—信息与通信工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象