基于多尺度阶梯时频Conformer GAN的语音增强算法  被引量:4

Speech enhancement algorithm based on multi-scale ladder-type time-frequency Conformer GAN

在线阅读下载全文

作  者:金玉堂 王以松[1] 王丽会 赵鹏利 JIN Yutang;WANG Yisong;WANG Lihui;ZHAO Pengli(State Key Laboratory of Public Big Data(Guizhou University),Guiyang Guizhou 550025,China;Xuchang Electric Vocational College,Xuchang Henan 461000,China)

机构地区:[1]公共大数据国家重点实验室(贵州大学),贵阳550025 [2]许昌电气职业学院,河南许昌461000

出  处:《计算机应用》2023年第11期3607-3615,共9页journal of Computer Applications

基  金:国家自然科学基金资助项目(U1836205)。

摘  要:针对频率域语音增强算法中因相位混乱产生人工伪影,导致去噪性能受限、语音质量不高的问题,提出一种基于多尺度阶梯型时频Conformer生成对抗网络(MSLTF-CMGAN)的语音增强算法。将语音语谱图的实部、虚部和振幅谱作为输入,生成器首先在多个尺度上利用时间-频率Conformer学习时域和频域的全局及局部特征依赖;其次,利用Mask Decoder分支学习振幅掩码,而Complex Decoder分支则直接学习干净的语谱图,融合这两个Decoder分支的输出可得到重建后的语音;最后,利用指标判别器判别语音的评价指标得分,通过极大极小训练使生成器生成高质量的语音。采用主观评价平均意见得分(MOS)和客观评价指标在公开数据集VoiceBank+Demand上与各类语音增强模型进行对比,结果显示,所提算法的MOS信号失真(CSIG)和MOS噪声失真(CBAK)比目前最先进的方法CMGAN(基于Conformer的指标生成对抗网络语音增强模型)分别提高了0.04和0.07,尽管它的MOS整体语音质量(COVL)和语音质量的感知评估(PESQ)略低于CMGAN,但与其他对比模型相比在多项主客观语音质量评估方面的评分均处于领先水平。Aiming at the problem of artificial artifacts due to phase disorder in frequency-domain speech enhancement algorithms,which limits the denoising performance and decreases the speech quality,a speech enhancement algorithm based on Multi-Scale Ladder-type Time-Frequency Conformer Generative Adversarial Network(MSLTF-CMGAN)was proposed.Taking the real part,imaginary part and magnitude spectrum of the speech spectrogram as input,the generator first learned the local and global feature dependencies between temporal and frequency domains by using time-frequency Conformer at multiple scales.Secondly,the Mask Decoder branch was used to learn the amplitude mask,and the Complex Decoder branch was directly used to learn the clean spectrogram,and the outputs of the two decoder branches were fused to obtain the reconstructed speech.Finally,the metric discriminator was used to judge the scores of speech evaluation metrics,and highquality speech was generated by the generator through minimax training.Comparison experiments with various types of speech enhancement models were conducted on the public dataset VoiceBank+Demand by subjective evaluation Mean Opinion Score(MOS)and objective evaluation metrics.Experimental results show that compared with current state-of-the-art speech enhancement method CMGAN(Comformer-based MetricGAN),MSLTF-CMGAN improves MOS prediction of the signal distortion(CSIG)and MOS predictor of intrusiveness of background noise(CBAK)by 0.04 and 0.07 respectively,even though its Perceptual Evaluation of Speech Quality(PESQ)and MOS prediction of the overall effect(COVL)are slightly lower than that of CMGAN,it still outperforms other comparison models in several subjective and objective speech evaluation metrics.

关 键 词:语音增强 多尺度 CONFORMER 生成对抗网络 指标判别器 深度学习 

分 类 号:TP391.9[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象