基于多重互信息约束的高表现力语音转换

High Expressiveness Voice Conversion Based on Multiple Mutual Information Constraints

作　　者：王光[1] 刘宗泽姜彦吉[1,3] 董浩 WANG Guang;LIU Zong-Ze;JIANG Yan-Ji;DONG Hao(Software College,Liaoning Technical University,Huludao 125105,China;Suzhou Automotive Research Institute,Tsinghua University,Suzhou 215134,China;OpenSafe Laboratory,Youce(Jiangsu)Safety Technology Co.Ltd.,Suzhou 215100,China)

机构地区：[1]辽宁工程技术大学软件学院,葫芦岛125105 [2]清华大学苏州汽车研究院,苏州215134 [3]优策(江苏)安全科技有限公司OpenSafe实验室,苏州215100

出　　处：《计算机系统应用》2024年第9期216-225,共10页Computer Systems & Applications

基　　金：辽宁省教育厅面上项目(LJKZ0338);葫芦岛市科技计划(2023JH(1)4/02b);广东省科技创新战略专项市县科技创新支撑项目(STKJ2023071)。

摘　　要：随着语音转换在人机交互领域的广泛应用,对于获取高表现力语音的需求日益显著.当前语音转换主要通过解耦声学特征实现,侧重对内容和音色特征的解耦,很少考虑语音中混合的情感特性,导致转换音频情感表现力不足.为解决上述问题,本文提出一种基于多重互信息约束的高表现力语音转换模型(MMIC-EVC).在对内容和音色特征进行解耦的基础上,引入表现力模块分别对话语级韵律和节奏特征进行建模,以实现情感特性的传递;随后通过最小化各特征之间的多重互信息变分对数上界,约束各编码器专注于解耦对应的声学嵌入.在CSTR-VCTK和ESD语音数据集上的实验表明,本模型的转换音频语音自然度评分(MOS)达到3.78,梅尔倒谱失真为5.39 dB,最佳最差占比测试结果大幅领先于基线模型,MMIC-EVC能够有效解耦韵律和节奏特征,并实现高表现力语音转换,为人机交互带来更加出色和自然的用户体验.As voice conversion technology becomes increasingly prevalent in human-computer interaction,the need for highly expressive speech continues to grow.Currently,voice conversion primarily relies on decoupling acoustic features,emphasizing the decoupling of content and timbre features,but often neglects the emotional features in speech,resulting in insufficient emotional expressiveness in converted audio.To address this problem,this study introduces a novel model for highly expressive voice conversion with multiple mutual information constraints(MMIC-EVC).On top of decoupling content and timbre features,the model incorporates an expressiveness module to capture discourse-level prosody and rhythm features,enabling the conveyance of emotional features.It constrains every encoder to focus on its acoustic embedding by minimizing the variational upper bounds of multiple mutual information between features.Experiments on the CSTR-VCTK and ESD speech datasets indicate that the converted audio of the proposed model achieves a mean opinion score of 3.78 for naturalness and a Mel cepstral distortion of 5.39 dB,significantly outperforming baseline models in the best-worst sensitivity test.The MMIC-EVC model effectively decouples rhythmic and prosodic features,facilitating high expressiveness in voice conversion,and thereby providing a more natural and better user experience in humancomputer interaction.

关键词：语音转换特征解耦互信息约束韵律建模人机交互

分类号：TN912.3[电子电信—通信与信息系统] TP18[电子电信—信息与通信工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多重互信息约束的高表现力语音转换

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多重互信息约束的高表现力语音转换

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索