Improving the performance of speech waveform synthesis using WaveNet fused with phase information  

在线阅读下载全文

作  者:ZHENG Changyan YANG Jibin ZHANG Xiongwei SUN Meng 

机构地区:[1]High-tech Institute,Fan Gong-ting South Street on the 12^(th),Qingzhou 262500 [2]Army Engineering University,Nanjing 210007

出  处:《Chinese Journal of Acoustics》2022年第1期1-19,共19页声学学报(英文版)

基  金:supported by the National Natural Science Foundation of China(62071484,61471394);NSF of Jiangsu Province for Excellent Young Scholars(BK20180080)。

摘  要:Compared with phase spectrum,magnitude spectrum can represent most speech information,hence many speech processing tasks pay much attention on manipulating magnitude spectrum and use the imperfect vocoder parameters or mismatched phase spectrum to synthesize the waveform,which leads to an obvious distortion of speech quality.To address this problem,a modified version of Wave Net model fused with phase information is proposed to synthesize the speech with higher quality.In the Wave Net model,the original or processed phase spectrum of speech and the enhanced magnitude spectrum are concatenated as the condition input,and then the predicted speech waveform is generated directly from this input,which is a kind of fusion feature.The proposed method can realize the effective utilization of the phase information and is verified in two tasks including voice conversion(VC)and bone-conducted speech enhancement(BSE).Two kinds of phase spectrum,the modified group delay(MGD)spectrum and the instantaneous frequency deviation spectrum,are compared comprehensively in the simulation experiments,and the influence of the fusion feature on the bandwidth extension Wave Net model and the teacher-student Wave Net model is also explored.In VC experiments,the A/B test shows the generated speech using the teacher-student Wave Net model is much better than using the STRAIGHT vocoder.In BSE experiments,the results show that,using the bandwidth extension Wave Net model via the feature fused with MGD spectrum,the mean opinion score(MOS)of the enhanced speech increases by 54.3%compared with the original bone-conducted speech.All the results demonstrate that the phase-fused condition input can supplement single magnitude spectrum efficiently and help the Wave Net vocoder achieve promising improvement on the quality of the synthesized speech.

关 键 词:FUSED PHASE WAVE 

分 类 号:TN912.3[电子电信—通信与信息系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象