融合信息扰动与特征解耦的单样本语音转换  

One-shot voice conversion integrating informationperturbation and feature decoupling

在线阅读下载全文

作  者:王光[1] 刘宗泽 董浩[2] 姜彦吉[1] Wang Guang;Liu Zongze;Dong Hao;Jiang Yanji(College of Software,Liaoning Technical University,Huludao Liaoning 125105,China;Suzhou Automotive Research Institute,Tsinghua University,Suzhou Jiangsu 215134,China)

机构地区:[1]辽宁工程技术大学软件学院,辽宁葫芦岛125105 [2]清华大学苏州汽车研究院,江苏苏州215134

出  处:《计算机应用研究》2024年第10期3081-3086,共6页Application Research of Computers

基  金:葫芦岛市科技计划资助项目(2023JH(1)4/02b)。

摘  要:单样本语音转换的特性是利用单条目标说话人的语音样本即可实现身份的转换,但由于声学特征呈现复杂的相互作用和动态变化,现有方法难以充分将单样本语音中的说话人音色与其他声学特征解耦,导致转换音频在听觉上仍与源说话人的音色特征相似,存在说话人音色泄露情况。为此提出一种融合信息扰动与特征解耦的单样本语音转换模型,即IPFD-VC模型。首先,引入信息扰动模块对语音信号进行三次扰动操作,去除输入内容和韵律编码器中的冗余信息;其次,将处理后的语音信号送入各编码器,并结合最小化互信息策略进一步解耦声学特征,降低不同特征与说话人音色特征的相关性;最后通过解码器及声码器输出转换音频。实验结果表明:IPFD-VC模型转换音频的语音自然度和说话人相似度分别达到3.72和3.68,与目前先进的UUVC模型相比,梅尔倒谱失真降低0.26 dB。该模型能够有效对声学特征进行解耦,捕获目标说话人音色特征,同时保持源语言内容和韵律变化,降低说话人音色泄露风险。The characteristic of one-shot voice conversion is the ability to transform identity using only a single speech sample from the target speaker.However,the intricate interactions and dynamic variations of acoustic features pose challenges for existing methods to fully disentangle the speaker’s timbre from other acoustic features,resulting in the leakage of the original speaker’s timbre in the converted audio.To tackle this challenge,this paper proposed the IPFD-VC model to incorporate information perturbation and feature decoupling.The model initiated three perturbation operations to the voice signal through an information perturbation module in order to remove redundant information from input and the prosody encoder.Then it enabled to feed the processed signal into each encoders.The model employed a strategy of minimizing mutual information to further decouple the acoustic features,thereby diminishing their correlation with the speaker’s timbre characteristics.The decoder and vocoder subsequently output the convert audio.The experiments show that the IPFD-VC model achieves scores of 3.72 for voice naturalness and 3.68 for speaker similarity.In comparison to the advanced UUVC model,the model reduced the Mel-cepstral distortion by 0.26 dB.The IPFD-VC model effectively decouples acoustic features,captures the target speaker’s timbre,preserves the source language content and rhythmic variations,and mitigates the risk of speaker timbre leakage.

关 键 词:单样本语音转换 信息扰动 特征解耦 说话人音色泄露 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象