检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李燕萍[1] 谭誌诚 胡澄阳 杨露露 邵曦[1] LI Yanping;TAN Zhicheng;HU Chengyang;YANG Lulu;SHAO Xi(School of Communications and Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing,Jiangsu 210003,China)
机构地区:[1]南京邮电大学通信与信息工程学院,江苏南京210003
出 处:《信号处理》2025年第1期183-192,共10页Journal of Signal Processing
基 金:国家科技创新2030——“新一代人工智能”重大项目(2020AAA0106200);国家自然科学基金(61936005,62001038);南京邮电大学校级自然科学基金(NY223115)。
摘 要:在跨语种语音转换(Cross-Lingual Voice Conversion, CLVC)任务中,如何保留转换语音中的内容信息,同时有效地提高转换语音的相似度和自然度是目前的研究难题。传统的编码器-解码器模型应用于跨语种语音转换时,通常会对语音进行相互独立的内容编码和说话人编码,导致得到的内容表征和说话人表征之间存在一定的信息泄露,从而使得转换语音的说话人个性相似度不够理想。为了解决上述存在的问题,本文提出一种基于SE注意力机制(Squeeze-and-Excitation Attention Mechanism, SE)与互信息量(Mutual Information, MI)的跨语种语音转换方法,实现有效的表征解纠缠,完成开集情形下高质量的跨语种语音转换。首先,在内容编码器中引入SE注意力机制以利用其对全局信息的提取能力,使得内容编码器可以提取包含全局上下文信息的内容表征;同时,在各个表征之间引入互信息量,并通过对其最小化来大幅减少各个表征之间存在的信息泄露问题,从而实现有效的表征解纠缠。在VCTK英文语料库和AISHELL-3中文语料库上的实验结果表明,本文提出的基于SE注意力机制与互信息量的跨语种语音转换模型(Squeeze-and-Excitation Attention Mechanism and Mutual Information, SEMI)具有更强的表征提取能力,相比于基准模型,其在客观评价中MCD值降低了10.89%,在主观评价中MOS值和ABX值分别提升了10.94%和12.06%,验证了SEMI模型在转换语音质量和说话人个性相似度方面都取得显著进展,实现了开集情形下高质量的跨语种语音转换。In cross-lingual voice conversion(CLVC)tasks,how to preserve the content information in converted speech while effectively improving the similarity and naturalness of converted speech is currently a research challenge.When the traditional encoder-decoder model is applied to cross-lingual voice conversion,it generally performs separate content encoding and speaker encoding on the speech,resulting in information leakage between the content representation and speaker representation,and the personality similarity of converted speech is not ideal.To address this problem,this paper proposes a cross-lingual voice conversion method based on the Squeeze-and-Excitation attention mechanism(SE)and Mutual Information(MI),which achieved effective representation disentanglement and high-quality cross-lingual voice conversion in open set case.First,the SE attention mechanism was introduced into the content encoder to extract content representation containing global contextual information.Simultaneously,mutual information was introduced between different representations to minimize information leakage and achieve effective representation disentanglement.The experimental results on the VCTK English corpus and AISHELL-3 Chinese corpus show that the proposed model had stronger and better representation extraction ability.Compared with the benchmark model,in objective evaluation,the MCD value is reduced by 10.89%.In subjective evaluation,the MOS and ABX are increased by 10.94%and 12.06%,respectively,which indicates that the SEMI model significantly improved the quality of converted speech and the similarity of speaker personalities,thereby achieving high-quality cross-lingual voice conversion in open set case.
关 键 词:跨语种语音转换 SE注意力机制 互信息量 全局上下文信息
分 类 号:TN912[电子电信—通信与信息系统]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.191.5.237