多尺度富有表现力的汉语语音合成被引量：1

Multi-scale Expressive Chinese Speech Synthesis

作　　者：高洁[1] 肖大军徐遐龄刘绍翰[1] 杨群[1] GAO Jie;XIAO Dajun;XU Xialing;LIU Shaohan;YANG Qun(College of Computer Science and Technology,Nanjing University of Aeronautics&Astronautics,Nanjing 211106,China;Central China Branch of State Grid Corporation of China,Wuhan 430070,China)

机构地区：[1]南京航空航天大学计算机科学与技术学院,南京211106 [2]国家电网公司华中分部,武汉430070

出　　处：《数据采集与处理》2023年第6期1458-1468,共11页Journal of Data Acquisition and Processing

摘　　要：常见的增强合成语音表现力方法通常是将参考音频编码为固定维度的韵律嵌入,与文本信息一起输入语音合成模型的解码器,从而向语音合成模型中引入变化的韵律信息,但这种方法仅提取了音频整体级别的韵律信息,忽略了字或音素级别的细粒度韵律信息,导致合成语音依然存在部分字词发音不自然、音调语速平缓的现象。针对这些问题,本文提出一种基于Tacotron2语音合成模型的多尺度富有表现力的汉语语音合成方法。该方法利用基于变分自编码器的多尺度韵律编码网络,提取参考音频整体级别的韵律信息和音素级别的音高信息,然后将其与文本信息一起输入语音合成模型的解码器。此外,在训练过程中通过最小化韵律嵌入与音高嵌入之间的互信息,消除不同特征表示之间的相互关联,分离不同特征表示。实验结果表明,该方法与单一尺度的增强表现力语音合成方法相比,听力主观平均意见得分提高了约2%,基频F0帧错误率降低了约14%,该方法可以生成更加自然且富有表现力的语音。Common methods for enhancing the expressiveness of synthesized speech typically involve encoding the reference audio as a fixed-dimensional prosody embedding.This embedding is then fed into the decoder of the speech synthesis model along with the text embedding,thereby introducing prosody information into the speech synthesis process.However,this approach only captures prosody information at the global level of speech,neglecting fine-grained prosody details at the word or phoneme level.Consequently,the synthesized speech may still exhibit unnatural pronunciation and flat intonation in certain words.To tackle these issues,this paper introduces a multi-scale expressive Chinese speech synthesis method based on Tacontron2.Initially,two variational auto-encoders are employed to extract global-level prosody information and phoneme-level pitch information from the reference audio.This multi-scale variational information is then incorporated into the speech synthesis model.Additionally,during the training process,we minimize the mutual information between the rhyme embedding and the pitch embedding.This step aims to eliminate intercorrelation between different feature representations and to separate distinct feature representations.Experimental results demonstrate that our proposed method enhances the subjective mean opinion score by 2% and reduces the F0 frame error rate by 14% compared to the single-scale expressive speech synthesis method.The findings suggest that our method generates speech that is more natural and expressive.

关键词：语音合成神经网络变分自动编码器注意力机制韵律增强

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

多尺度富有表现力的汉语语音合成被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

多尺度富有表现力的汉语语音合成 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

多尺度富有表现力的汉语语音合成被引量：1