检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:徐晓娜[1,2] 李宁 赵悦 XU Xiao-na;LI Ning;ZHAO Yue(College of Information Engineering,Minzu University of China,Beijing 100081,China;Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE,Minzu University of China,Beijing 100081,China)
机构地区:[1]中央民族大学信息工程学院,北京100081 [2]中央民族大学民族语言智能分析与安全治理教育部重点实验室,北京100081
出 处:《计算机仿真》2025年第3期283-288,298,共7页Computer Simulation
基 金:国家自然科学基金资助项目(61976236)。
摘 要:Tacotron模型的应用在藏语端到端语音合成取得了较好的效果,然而基于循环神经网络(RNN)的模型存在训练和预测效率较低以及长距离信息丢失问题。为进一步提升藏语语音合成效果,提出了一种基于Transformer的端到端语音合成模型来实现藏语多方言的语音合成。上述模型使用多头注意力机制并行构建编码器(Encoder)与解码器(Decoder)中的隐藏状态,从而有效解决了建模长距离信息相关性的问题,并且能够发挥多GPU并行训练的优势。选用三种不同的合成基元(藏文字,拉丁字母,藏文部件)作为声学模型的输入,使用transformer Text-To-Speech(TTS)网络生成梅尔谱图,然后使用训练好的WaveNet将梅尔谱转化为最终的语音波形。进行了多项对比实验,首先对比了Tacotron与基于transformer的端到端模型应用于藏语多方言语音合成的效果,并且对比了三种合成基元在本文模型上的表现,除此之外,还进行了单GPU训练与多GPU并行训练的对比实验。实验结果显示,基于Transformer的端到端语音合成模型应用于藏语多方言语音合成的效果比Tacotron模型更好,选用拉丁字母为合成基元并且采用多个GPU并行训练得到的音频具有更好的清晰度和自然度。The application of Tacotron model in Tibetan end-to-end speech synthesis has achieved good results,however,this model which is based Recurrent Neural Network(RNN)suffers from low training and prediction efficiency and long-range information loss.To further improve the effect of Tibetan speech synthesis,an end-to-end speech synthesis model based on Transformer is proposed to realize the speech synthesis of multiple dialects of Tibetan.In the model,the hidden state in the Encoder and Decoder is constructed in parallel by using the multi-head attention mechanism,which effectively solves the problem of modeling long-distance information correlation and can take the advantage of multi-GPU parallel training.In this work,three different Tibetan speech synthesis units(Tibetan characters,Latin letters,Tibetan components)are selected as the input of the acoustic model,the transformer Text-To-Speech(TTS)network is used to generate the Mel spectrum,and then the trained WaveNet is used to convert the Mel spectrum into the final speech waveform.A series of comparative experiments is conducted.We compare the effects of Tacotron and Transformer based end-to-end model applied to Tibetan multi-dialect speech synthesis,and compare the performance of three kinds of synthesis units on our model,too.In addition,we also conduct comparison experiments between single-GPU training and multi-GPU parallel training.Experimental results show that the Transformer-based end-to-end speech synthesis model is better than Tacotron when applied to Tibetan multi-dialect speech synthesis.The speech obtained by using Latin letters as the synthesis unit and parallel training with multi-GPU has better clarity and naturalness.
分 类 号:TP39[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49