基于全局风格嵌入的多说话人印尼语语音合成  

Multi-Speaker Indonesian Speech Synthesis Based on Global Style Embedding

在线阅读下载全文

作  者:杨益灵 杨鉴[1] 王发亮 

机构地区:[1]云南大学,信息学院,云南 昆明

出  处:《计算机科学与应用》2023年第1期126-135,共10页Computer Science and Application

摘  要:由于印尼语高质量语料数据库的稀缺,该语种多说话人语音合成系统性能仍有待提升。因此以缓解低资源对多说话人语音合成性能的影响为目的,研究并实现了基于GST-Tacotron2模型框架的印尼语端到端语音合成系统。选用8.5小时的单说话人印尼语数据训练的合成系统,合成语音的MOS评分达4.11。在此基础上,设计多说话人印尼语语音合成系统,着重探索了在仅利用其他印尼语说话人少量语音数据进行混合训练时,采用说话人编码方法对多说话人合成自然度的影响。实验结果表明,利用合计14.5小时多说话人语音数据训练的合成模型,主位说话人合成语音的MOS评分到达了4.12,梅尔倒谱失真比单说话人最优模型降低了7.2%。其他说话人合成语音的MOS评分均大于3.60,验证了所提方法的有效性。Due to the scarcity of high-quality Indonesian corpus databases, the performance of Indonesian multi-speaker speech synthesis systems still needs to be improved. Therefore, in order to alleviate the impact of low-resources on the performance of multi-speaker speech synthesis, an end-to-end speech synthesis system in Indonesian based on the GST-Tacotron2 model framework is studied and implemented. A synthesis system trained on 8.5 hours of single-speaker Indonesian data achieves a MOS (Mean Opinion Score) score of 4.11 for synthesized speech. On this basis, a multi-speaker Indonesian speech synthesis system is designed, and the influence of the speaker coding method on the naturalness of multi-speaker synthesis is emphatically explored when only a small amount of speech data of other Indonesian speakers is used for hybrid training. The experimental results show that the MOS score of the synthesized speech of the main speaker reaches 4.12 using the synthesis model trained with a total of 14.5 hours of multi-speaker speech data. The MCD is 7.2% lower than the single-speaker optimal model. The MOS scores of the synthesized speech of other speakers are all greater than 3.60, which verifies the effectiveness of the proposed method.

关 键 词:语音合成 多说话人 风格迁移 低资源 印尼语 

分 类 号:TN912.33[电子电信—通信与信息系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象