引入预训练表示混合矢量量化和CTC的语音转换

Voice Conversion Combining Vector Quantization and CTC Introducing Pre-Trained Representation

作　　者：王琳黄浩[1] WANG Lin;HUANG Hao(School of Information Science and Engineering,Xinjiang University,Urumqi 830017,Xinjiang,China)

机构地区：[1]新疆大学信息科学与工程学院,新疆乌鲁木齐830017

出　　处：《计算机工程》2024年第4期313-320,共8页Computer Engineering

基　　金：新疆维吾尔自治区重点实验室开放课题(2020D04047)。

摘　　要：预训练模型通过自监督学习表示在非平行语料语音转换(VC)取得了重大突破。随着自监督预训练表示(SSPR)的广泛使用,预训练模型提取的特征中被证实包含更多的内容信息。提出一种基于SSPR同时结合矢量量化(VQ)和联结时序分类(CTC)的VC模型。将预训练模型提取的SSPR作为端到端模型的输入,用于提高单次语音转换质量。如何有效地解耦内容表示和说话人表示成为语音转换中的关键问题。使用SSPR作为初步的内容信息,采用VQ从语音中解耦内容和说话人表示。然而,仅使用VQ只能将内容信息离散化,很难将纯粹的内容表示从语音中分离出来,为了进一步消除内容信息中说话人的不变信息,提出CTC损失指导内容编码器。CTC不仅作为辅助网络加快模型收敛,同时其额外的文本监督可以与VQ联合优化,实现性能互补,学习纯内容表示。说话人表示采用风格嵌入学习,2种表示作为系统的输入进行语音转换。在开源的CMU数据集和VCTK语料库对所提的方法进行评估,实验结果表明,该方法在客观上的梅尔倒谱失真(MCD)达到8.896 d B,在主观上的语音自然度平均意见分数(MOS)和说话人相似度MOS分别为3.29和3.22,均优于基线模型,此方法在语音转换的质量和说话人相似度上能够获得最佳性能。Pre-trained models have achieved significant breakthroughs in nonparallel corpus Voice Conversion(VC)via Self-Supervised Pre-trained Representation(SSPR).Features extracted by pre-trained models contain a significant amount of content information owing to the widespread use of SSPR.This study proposes a VC model based on the combination of SSPR Vector Quantization(VQ)and Connectionist Temporal Classification(CTC).It uses the SSPR extracted from a pre-trained model as input to improve the quality of single VC.The effective decoupling of content and speaker representations has become a key issue in VC.Using SSPR as the initial content information,VQ is performed to decouple content and speaker representations from speech.However,performing only VQ discretizes only the content information,thus rendering it difficult to separate pure content representations from speech.To further eliminate speaker-invariant information from the content information,a CTC loss-guided content encoder is proposed.CTC not only serves as an auxiliary network to accelerate model convergence but also its additional text supervision can be jointly optimized with VQ to achieve complementary performance and learn pure-content representations.Speaker representations adopt style-embedding learning,and two representations are used as inputs for VC in the system.The proposed method is evaluated on the open-source CMU dataset and VCTK corpus.Experimental results show that the proposed method achieves an objective Mel-Cepstrum Distortion(MCD)of 8.896 dB,as well as subjective Mean Opinion Score(MOS)of speech naturalness and speaker similarity of 3.29 and 3.22,respectively,both of which are better than those of the baseline model.This method achieves the best performance in terms of VC quality and speaker similarity.

关键词：预训练表示自监督学习矢量量化解耦联结时序分类

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

引入预训练表示混合矢量量化和CTC的语音转换

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

引入预训练表示混合矢量量化和CTC的语音转换

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索