基于双分支注意力U-Net的语音增强方法  被引量:1

Speech enhancement method based on two-branch attention and U-Net

在线阅读下载全文

作  者:曹洁[1,2] 王宸章 梁浩鹏 王乔 李晓旭 Cao Jie;Wang Chenzhang;Liang Haopeng;Wang Qiao;Li Xiaoxu(School of Computer&Communication,Lanzhou University of Technology,Lanzhou 730050,China;College of Information Engineering,Lanzhou City University,Lanzhou 730050,China)

机构地区:[1]兰州理工大学计算机与通信学院,兰州730050 [2]兰州城市学院信息工程学院,兰州730050

出  处:《计算机应用研究》2024年第4期1112-1116,共5页Application Research of Computers

基  金:甘肃省重点研发计划资助项目(22YF7GA130)。

摘  要:针对语音增强网络对全局语音相关特征提取困难、对语音局部上下文信息的捕捉效果不佳的问题,提出了一种基于双分支注意力U-Net的时域语音增强方法,该方法使用U-Net编码器-解码器结构,将单通道带噪语音经过一维卷积后得到的高维时域特征作为输入。首先利用残差连接设计了基于Conformer的残差卷积来增强网络降噪的能力。其次设计了双分支注意力机制结构,利用全局和局部注意力获取带噪语音中更丰富的上下文信息,同时有效表示长序列特征,提取更多样的特征信息。最后结合时域频域损失函数构建了加权损失函数对网络进行训练,提高网络的语音增强性能。使用了多个指标对增强语音的质量和可懂度等进行评价,在公开数据集Voice Bank+DEMAND上的增强后的语音感知质量(PESQ)为3.11,短时可懂度(STOI)为95%,信号失真度(CSIG)为4.44,噪声失真测(CBAK)为3.60,综合质量测度(COVL)为3.81,其中PESQ相较于SE-Conformer提高了7.6%,相较于TSTNN提高了5.1%。实验结果表明,所提方法在语音降噪的各个指标都表现出更优的实验结果,能够完成语音增强任务的相关要求。Aiming at the problem that speech enhancement networks have difficulty in extracting global speech-related features and are ineffective in capturing local contextual information of speech.This paper proposed a two-branch attention and U-Net-based time-domain speech enhancement method,which used a U-Net encoder-decoder structure and took the high-dimensional time-domain features obtained from a single-channel noisy speech after one-dimensional convolution as input.Firstly,this paper designed Conformer-based residual convolution to enhance the noise reduction ability of network by utilizing residual connection.Secondly,this paper designed a two-branch attention mechanism structure,which utilized global and local attention to obtain richer contextual information in the noisy speech,and at the same time,to effectively represent the long sequence features and extract more diverse feature information.Finally,this paper constructed a weighted loss function by combining the loss function in the time domain and frequency domain to train the network and improve the performance in speech enhancement.This paper used several metrics to evaluate the quality and intelligibility of the enhanced speech,the enhanced speech perceptual evaluation of speech quality(PESQ)on the public datasets Voice Bank+DEMAND is 3.11,the short-time objective intelligibility(STOI)is 95%,the composite measure for predicting signal rating(CSIG)is 4.44,the composite measure for predicting background noise(CBAK)is 3.60,and the composite measure for predicting overall processed speech quality(COVL)is 3.81,in which the PESQ is improved by 7.6%compared to SE-Conformer,and improved by 5.1%compared to TSTNN improved by 5.1%.Experimental results show that the proposed method achieves better results in various metrics of speech denoising and meets the requirements for speech enhancement tasks.

关 键 词:语音增强 双分支注意力机制 时域 单通道 

分 类 号:TN912.35[电子电信—通信与信息系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象