U-net网络中融合多头注意力机制的单通道语音增强  被引量:12

Monaural speech enhancement using U-net fused with multi-head self-attention

在线阅读下载全文

作  者:范君怡 杨吉斌 张雄伟 郑昌艳 FAN Junyi;YANG Jibin;ZHANG Xiongwei;ZHENG Changyan(Graduate School,Army Engineering University Nanjing 210007;College of Command and Control Engineering,Army Engineering University ,Nanjing 210007;Department of Test Control,High-tech Institute ,Qingzhou 262500)

机构地区:[1]陆军工程大学研究生院,南京210007 [2]陆军工程大学指挥控制工程学院,南京210007 [3]火箭军士官学校测试控制系,青州262500

出  处:《声学学报》2022年第6期703-716,共14页Acta Acustica

基  金:国家自然科学基金项目(62071484)资助。

摘  要:在低信噪比和突发背景噪声条件下,已有的深度学习网络模型在单通道语音增强方面效果并不理想,而人类可以利用语音的长时相关性对不同的语音信号形成综合感知。因此刻画语音的长时依赖关系有助于改进低信噪比和突发背景噪声下的增强性能。受该特性的启发,提出一种融合多头注意力机制和U-net深度网络的增强模型TU-net,实现基于时域的端到端单通道语音增强。TU-net网络模型采用U-net网络的编解码层对带噪语音信号进行多尺度特征融合,并利用多头注意力机制实现双路径Transformer,用于计算语音掩模,更好地建模长时相关性。该模型在时域、时频域和感知域计算损失函数,并通过加权组合损失函数指导训练。仿真实验结果表明,TU-net在低信噪比和突发背景噪声条件下增强语音信号的语音质量感知评估(PESQ)、短时客观可懂度(STOI)和信噪比增益等多个评价指标都优于同类的单通道增强网络模型,且保持相对较少的网络模型参数。Under low Signal-to-Noise Ratio(SNR) and burst background noise conditions,the enhancement effect of existing deep learning-based speech enhancement methods is not satisfactory.In contrast,humans can exploit the longterm correlation of speech to form an integrated perception of different speech signals.Thus,describing the long-term dependencies of speech can help improve the enhancement performance under low SNR and burst background noise.Inspired by this feature,a time domain end-to-end monaural speech enhancement model TU-net that fuses the multi-head self-attention mechanism and U-net deep network is proposed.The TU-net network adopts the codec layer structure of U-net to achieve multi-scale feature fusion,and introduces the dual-path Transformer module using the multi-head self-attention mechanism to calculate the speech mask and better model long-term correlation.TU-net model is trained with a weighted sum loss function in the time domain,time-frequency domain and perceptual domain.Exhaustive experiments are carried out and the results show that TU-net outperforms than other similar monaural enhancement network models in several evaluation metrics such as Perceptual Evaluation of Speech Quality(PESQ),Short-Time Objective Intelligibility(STOI) and SNR gain under low SNR and burst background noise conditions,and maintains relatively few network model parameters.

关 键 词:语音增强 长时相关性 语音信号 信噪比增益 注意力机制 背景噪声 损失函数 网络模型 

分 类 号:TN912.35[电子电信—通信与信息系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象