融合大语言模型和预训练模型的少量语料说话人-情感语音转换方法  

Speaker-emotion voice conversion method with limited corpus based on large language model and pre-trained model

作  者:鲁超峰 陶冶[1] 文连庆 孟菲 秦修功 杜永杰 田云龙 LU Chaofeng;TAO Ye;WEN Lianqing;MENG Fei;QIN Xiugong;DU Yongjie;TIAN Yunlong(School of Information Science and Technology,Qingdao University of Science and Technology,Qingdao Shandong 266061,China;School of Computer Science and Engineering,Linyi University,Linyi Shandong 276000,China;Beijing Research Institute of Automation for Machinery Industry Company Limited,Beijing 100120,China;Digital Home Network National Engineering Laboratory,Qingdao Shandong 266000,China)

机构地区:[1]青岛科技大学信息科学技术学院,山东青岛266061 [2]临沂大学信息科学与工程学院,山东临沂276000 [3]北京机械工业自动化研究所有限公司,北京100120 [4]数字家庭网络国家工程实验室,山东青岛266000

出  处:《计算机应用》2025年第3期815-822,共8页journal of Computer Applications

基  金:国家重点研发计划项目(2023YFF0612100);青岛市关键技术攻关及产业化示范类项目(24-1-2-qljh-19-gx)。

摘  要:针对很少有人将说话人转换和情感转换结合起来研究,且实际场景中的目标说话人情感语料通常很少,不足以从头训练一个强泛化性模型的问题,提出一种融合大语言模型和预训练情感语音合成模型的少量语料说话人-情感语音转换(LSEVC)方法。首先,使用大语言模型生成带有所需情感标签的文本;其次,使用目标说话人语料微调预训练情感语音合成模型以嵌入目标说话人;然后,将生成的文本合成情感语音,以达到数据增强的目的;再次,使用合成语音与源目标语音共同训练说话人-情感语音转换模型;最后,为了进一步提升转换语音的说话人相似度和情感相似度,使用源目标说话人情感语音微调模型。在公共语料库和一个中文小说语料库上的实验结果表明,综合考虑评价指标情感相似度平均得分(EMOS)、说话人相似度平均意见得分(SMOS)、梅尔倒谱失真(MCD)和词错误率(WER)时,所提方法优于CycleGAN-EVC、Seq2Seq-EVC-WA2和SMAL-ET2等方法。Aiming at the problems that few people have combined research on speaker conversion and emotional voice conversion,and the emotional corpora of a target speaker in actual scenes are usually small,which are not enough to train strong generalization models from scratch,a Speaker-Emotion Voice Conversion with Limited corpus(LSEVC)was proposed with fusion of large language model and pre-trained emotional speech synthesis model.Firstly,a large language model was used to generate text with required emotion tags.Secondly,a pre-trained emotional speech synthesis model was fine-tuned by using the target speaker corpus to embed into the target speaker.Thirdly,the emotional speech was synthesized from the generated text for data augmentation.Fourthly,the synthesized speech and source target speech were used to co-train speaker-emotion voice conversion model.Finally,to further enhance speaker similarity and emotional similarity of converted speech,the model was fine-tuned by using source target speaker’s emotional speech.Experiments were conducted on publicly available corpora and a Chinese fiction corpus.Experimental results show that the proposed method outperforms CycleGAN-EVC,Seq2Seq-EVC-WA2,SMAL-ET2 and other methods when considering evaluation indicators—Emotional similarity Mean Opinion Score(EMOS),Speaker similarity Mean Opinion Score(SMOS),Mel Cepstral Distortion(MCD),and Word Error Rate(WER).

关 键 词:少量语料 说话人-情感语音转换 大语言模型 预训练情感语音合成模型 微调 

分 类 号:TN912.3[电子电信—通信与信息系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象