检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王琳 黄浩[1] WANG Lin;HUANG Hao(School of Information Science and Engineering,Xinjiang University,Urumqi 830017,Xinjiang,China)
机构地区:[1]新疆大学信息科学与工程学院,新疆乌鲁木齐830017
出 处:《计算机工程》2024年第4期313-320,共8页Computer Engineering
基 金:新疆维吾尔自治区重点实验室开放课题(2020D04047)。
摘 要:预训练模型通过自监督学习表示在非平行语料语音转换(VC)取得了重大突破。随着自监督预训练表示(SSPR)的广泛使用,预训练模型提取的特征中被证实包含更多的内容信息。提出一种基于SSPR同时结合矢量量化(VQ)和联结时序分类(CTC)的VC模型。将预训练模型提取的SSPR作为端到端模型的输入,用于提高单次语音转换质量。如何有效地解耦内容表示和说话人表示成为语音转换中的关键问题。使用SSPR作为初步的内容信息,采用VQ从语音中解耦内容和说话人表示。然而,仅使用VQ只能将内容信息离散化,很难将纯粹的内容表示从语音中分离出来,为了进一步消除内容信息中说话人的不变信息,提出CTC损失指导内容编码器。CTC不仅作为辅助网络加快模型收敛,同时其额外的文本监督可以与VQ联合优化,实现性能互补,学习纯内容表示。说话人表示采用风格嵌入学习,2种表示作为系统的输入进行语音转换。在开源的CMU数据集和VCTK语料库对所提的方法进行评估,实验结果表明,该方法在客观上的梅尔倒谱失真(MCD)达到8.896 d B,在主观上的语音自然度平均意见分数(MOS)和说话人相似度MOS分别为3.29和3.22,均优于基线模型,此方法在语音转换的质量和说话人相似度上能够获得最佳性能。Pre-trained models have achieved significant breakthroughs in nonparallel corpus Voice Conversion(VC)via Self-Supervised Pre-trained Representation(SSPR).Features extracted by pre-trained models contain a significant amount of content information owing to the widespread use of SSPR.This study proposes a VC model based on the combination of SSPR Vector Quantization(VQ)and Connectionist Temporal Classification(CTC).It uses the SSPR extracted from a pre-trained model as input to improve the quality of single VC.The effective decoupling of content and speaker representations has become a key issue in VC.Using SSPR as the initial content information,VQ is performed to decouple content and speaker representations from speech.However,performing only VQ discretizes only the content information,thus rendering it difficult to separate pure content representations from speech.To further eliminate speaker-invariant information from the content information,a CTC loss-guided content encoder is proposed.CTC not only serves as an auxiliary network to accelerate model convergence but also its additional text supervision can be jointly optimized with VQ to achieve complementary performance and learn pure-content representations.Speaker representations adopt style-embedding learning,and two representations are used as inputs for VC in the system.The proposed method is evaluated on the open-source CMU dataset and VCTK corpus.Experimental results show that the proposed method achieves an objective Mel-Cepstrum Distortion(MCD)of 8.896 dB,as well as subjective Mean Opinion Score(MOS)of speech naturalness and speaker similarity of 3.29 and 3.22,respectively,both of which are better than those of the baseline model.This method achieves the best performance in terms of VC quality and speaker similarity.
关 键 词:预训练表示 自监督学习 矢量量化 解耦 联结时序分类
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:52.15.133.37