基于BERT的端到端语音合成方法被引量：10

End-to-End Speech Synthesis Based on BERT

作　　者：安鑫[1,2] 代子彪李阳孙晓[1,2] 任福继[1,2] AN Xin;DAI Zi-biao;LI Yang;SUN Xiao;REN Fu-ji(School of Computer and Information,Hefei University of Technology,Hefei 230601,China;Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine,Hefei University of Technology,Hefei 230601,China)

机构地区：[1]合肥工业大学计算机与信息学院,合肥230601 [2]合肥工业大学情感计算与先进智能机器安徽省重点实验室,合肥230601

出　　处：《计算机科学》2022年第4期221-226,共6页Computer Science

基　　金：国家自然科学基金联合资助项目(U1613217);安徽省重点研究与开发计划项目(202004d07020004);中央高校基本科研业务专项资金(JZ2020YYPY0092)。

摘　　要：针对基于RNN的神经网络语音合成模型训练和预测效率低下以及长距离信息丢失的问题,提出了一种基于BERT的端到端语音合成方法,在语音合成的Seq2Seq架构中使用自注意力机制(Self-Attention Mechanism)取代RNN作为编码器。该方法使用预训练好的BERT作为模型的编码器(Encoder)从输入的文本内容中提取上下文信息,解码器(Decoder)采用与语音合成模型Tacotron2相同的架构输出梅尔频谱,最后使用训练好的WaveGlow网络将梅尔频谱转化为最终的音频结果。该方法在预训练BERT的基础上通过微调适配下游任务来大幅度减少训练参数和训练时间。同时,借助其自注意力(Self-Attention)机制还可以并行计算编码器中的隐藏状态,从而充分利用GPU的并行计算能力以提高训练效率,并能有效缓解远程依赖问题。与Tacotron2模型的对比实验表明,文中提出的模型能够在得到与Tacotron2模型相近效果的基础上,把训练速度提升1倍左右。To address the problems of low training and prediction efficiency of RNN-based neural network speech synthesis mo-dels and long-distance information loss,an end-to-end BERT-based speech synthesis method is proposed to use the Self-Attention Mechanism instead of RNN as an encoder in the Seq2 Seq architecture of speech synthesis.The method uses a pre-trained BERT as the model’s Encoder to extract contextual information from the input text content,the Decoder outputs the Mel spectrum by using the same architecture as the speech synthesis model Tacotron2,and finally the trained WaveGlow network is used to transform the Mel spectrum into the final audio result.This method significantly reduces the training parameters and training time by fine-tuning the downstream task based on pre-trained BERT.At the same time,it can also compute the hidden states in the encoder in parallel with its Self-Attention mechanism,thus making full use of the parallel computing power of the GPU to improve the training efficiency and effectively alleviate the remote dependency problem.Through comparison experiments with the Tacotron2 model,the results show that the model proposed in this paper is able to double the training speed while obtaining similar results to the Tacotron2 model.

关键词：语音合成循环神经网络 Seq2Seq WaveGlow 注意力机制

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于BERT的端到端语音合成方法被引量：10

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于BERT的端到端语音合成方法 被引量：10

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于BERT的端到端语音合成方法被引量：10