检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:高洁[1] 肖大军 徐遐龄 刘绍翰[1] 杨群[1] GAO Jie;XIAO Dajun;XU Xialing;LIU Shaohan;YANG Qun(College of Computer Science and Technology,Nanjing University of Aeronautics&Astronautics,Nanjing 211106,China;Central China Branch of State Grid Corporation of China,Wuhan 430070,China)
机构地区:[1]南京航空航天大学计算机科学与技术学院,南京211106 [2]国家电网公司华中分部,武汉430070
出 处:《数据采集与处理》2023年第6期1458-1468,共11页Journal of Data Acquisition and Processing
摘 要:常见的增强合成语音表现力方法通常是将参考音频编码为固定维度的韵律嵌入,与文本信息一起输入语音合成模型的解码器,从而向语音合成模型中引入变化的韵律信息,但这种方法仅提取了音频整体级别的韵律信息,忽略了字或音素级别的细粒度韵律信息,导致合成语音依然存在部分字词发音不自然、音调语速平缓的现象。针对这些问题,本文提出一种基于Tacotron2语音合成模型的多尺度富有表现力的汉语语音合成方法。该方法利用基于变分自编码器的多尺度韵律编码网络,提取参考音频整体级别的韵律信息和音素级别的音高信息,然后将其与文本信息一起输入语音合成模型的解码器。此外,在训练过程中通过最小化韵律嵌入与音高嵌入之间的互信息,消除不同特征表示之间的相互关联,分离不同特征表示。实验结果表明,该方法与单一尺度的增强表现力语音合成方法相比,听力主观平均意见得分提高了约2%,基频F0帧错误率降低了约14%,该方法可以生成更加自然且富有表现力的语音。Common methods for enhancing the expressiveness of synthesized speech typically involve encoding the reference audio as a fixed-dimensional prosody embedding.This embedding is then fed into the decoder of the speech synthesis model along with the text embedding,thereby introducing prosody information into the speech synthesis process.However,this approach only captures prosody information at the global level of speech,neglecting fine-grained prosody details at the word or phoneme level.Consequently,the synthesized speech may still exhibit unnatural pronunciation and flat intonation in certain words.To tackle these issues,this paper introduces a multi-scale expressive Chinese speech synthesis method based on Tacontron2.Initially,two variational auto-encoders are employed to extract global-level prosody information and phoneme-level pitch information from the reference audio.This multi-scale variational information is then incorporated into the speech synthesis model.Additionally,during the training process,we minimize the mutual information between the rhyme embedding and the pitch embedding.This step aims to eliminate intercorrelation between different feature representations and to separate distinct feature representations.Experimental results demonstrate that our proposed method enhances the subjective mean opinion score by 2% and reduces the F0 frame error rate by 14% compared to the single-scale expressive speech synthesis method.The findings suggest that our method generates speech that is more natural and expressive.
关 键 词:语音合成 神经网络 变分自动编码器 注意力机制 韵律增强
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.60