基于Transformer的藏语多方言语音合成

Transformer-Based Tibetan Multi-Dialect Speech Synthesis

作　　者：徐晓娜[1,2] 李宁赵悦 XU Xiao-na;LI Ning;ZHAO Yue(College of Information Engineering,Minzu University of China,Beijing 100081,China;Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE,Minzu University of China,Beijing 100081,China)

机构地区：[1]中央民族大学信息工程学院,北京100081 [2]中央民族大学民族语言智能分析与安全治理教育部重点实验室,北京100081

出　　处：《计算机仿真》2025年第3期283-288,298,共7页Computer Simulation

基　　金：国家自然科学基金资助项目(61976236)。

摘　　要：Tacotron模型的应用在藏语端到端语音合成取得了较好的效果,然而基于循环神经网络(RNN)的模型存在训练和预测效率较低以及长距离信息丢失问题。为进一步提升藏语语音合成效果,提出了一种基于Transformer的端到端语音合成模型来实现藏语多方言的语音合成。上述模型使用多头注意力机制并行构建编码器(Encoder)与解码器(Decoder)中的隐藏状态,从而有效解决了建模长距离信息相关性的问题,并且能够发挥多GPU并行训练的优势。选用三种不同的合成基元(藏文字,拉丁字母,藏文部件)作为声学模型的输入,使用transformer Text-To-Speech(TTS)网络生成梅尔谱图,然后使用训练好的WaveNet将梅尔谱转化为最终的语音波形。进行了多项对比实验,首先对比了Tacotron与基于transformer的端到端模型应用于藏语多方言语音合成的效果,并且对比了三种合成基元在本文模型上的表现,除此之外,还进行了单GPU训练与多GPU并行训练的对比实验。实验结果显示,基于Transformer的端到端语音合成模型应用于藏语多方言语音合成的效果比Tacotron模型更好,选用拉丁字母为合成基元并且采用多个GPU并行训练得到的音频具有更好的清晰度和自然度。The application of Tacotron model in Tibetan end-to-end speech synthesis has achieved good results,however,this model which is based Recurrent Neural Network(RNN)suffers from low training and prediction efficiency and long-range information loss.To further improve the effect of Tibetan speech synthesis,an end-to-end speech synthesis model based on Transformer is proposed to realize the speech synthesis of multiple dialects of Tibetan.In the model,the hidden state in the Encoder and Decoder is constructed in parallel by using the multi-head attention mechanism,which effectively solves the problem of modeling long-distance information correlation and can take the advantage of multi-GPU parallel training.In this work,three different Tibetan speech synthesis units(Tibetan characters,Latin letters,Tibetan components)are selected as the input of the acoustic model,the transformer Text-To-Speech(TTS)network is used to generate the Mel spectrum,and then the trained WaveNet is used to convert the Mel spectrum into the final speech waveform.A series of comparative experiments is conducted.We compare the effects of Tacotron and Transformer based end-to-end model applied to Tibetan multi-dialect speech synthesis,and compare the performance of three kinds of synthesis units on our model,too.In addition,we also conduct comparison experiments between single-GPU training and multi-GPU parallel training.Experimental results show that the Transformer-based end-to-end speech synthesis model is better than Tacotron when applied to Tibetan multi-dialect speech synthesis.The speech obtained by using Latin letters as the synthesis unit and parallel training with multi-GPU has better clarity and naturalness.

关键词：语音合成合成基元藏语多方言

分类号：TP39[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Transformer的藏语多方言语音合成

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Transformer的藏语多方言语音合成

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索