基于全卷积神经网络多任务学习的时域语音分离  

Time-domain Speech Separation Based on a Fully Convolutional Neural Network with Multitask Learning

在线阅读下载全文

作  者:孙林慧[1] 王春艳 张蒙 SUN Linhui;WANG Chunyan;ZHANG Meng(School of Communications and Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing,Jiangsu 210003,China)

机构地区:[1]南京邮电大学通信与信息工程学院,江苏南京210003

出  处:《信号处理》2024年第12期2228-2237,共10页Journal of Signal Processing

基  金:国家自然科学基金(61901227)。

摘  要:基于深度神经网络时频掩码进行语音分离时,目标信号相位一般采用混合信号的相位谱,且对性别组合缺乏针对性处理,这导致分离语音的质量不佳。针对该问题,本文提出一种基于全卷积神经网络联合性别组合检测(Fully Convolutional Neural Network-Gender Combination Detection,FCN-GCD)多任务学习的时域语音分离方法。该方法首先在语音分离支路构建全卷积神经网络,该网络的输入为时域两人混合语音信号,输出为目标讲话者的纯净语音信号,运用卷积编码器和反卷积解码器对特征进行压缩和重建,实现端到端的语音分离。其次将混合语音性别组合检测任务整合到语音分离网络中,在两个任务联合约束下获取辅助信息特征和语音分离特征,并将这些深度特征相结合来提升语音分离质量。该FCN-GCD方法是一种时域语音分离方法,不需要进行相位恢复和频域到时域的重构,相比频域处理方法,该处理过程简单,从而提高了运算效率。另外,该方法从混合语音性别组合检测任务中提取有效的辅助信息特征,利用联合特征实现了更有效的语音分离。实验结果表明,与单任务的语音分离方法相比,本文所提出的FCN-GCD方法在男男、女女和男女三种性别组合下均有效提高了语音质量,在语音质量感知评估(Perceptual Evaluation of Speech Quality,PESQ)、短时客观可懂度(Short-Time Objective Intelligibility,STOI)、信号干扰比(Signalto-Interference Ratio,SIR)、信号失真比(Signal-to-Distortion Ratio,SDR)和信号伪像比(Signal-to-Artifact Ratio,SAR)评价指标上均获得更佳的表现。When speech separation is performed based on a time-frequency mask using a deep neural network,the phase spec‐trum of the mixed signal is commonly used as the target signal phase,and the special processing for gender combination is lack‐ing,which results in poor quality of separated speech.Aiming to address the problem,this study introduces a novel speech sepa‐ration approach in the time domain based on a fully convolutional network and gender combination detection(FCN-GCD)with multitask learning.Its network is primarily composed of a speech separation module and a mixed speech gender combination de‐tection module.In the speech separation module,an FCN is constructed,where the input of the network is time-domain mixed speech signals of two people,and the output is the clean speech signal of the target speaker.The FCN compressed features along the convolutional layers of the encoder and reconstructed features along the deconvolutional layers of the decoder,achieving end-to-end speech separation.Additionally,by employing the multitask learning approach,the GCD task for mixed speech is in‐tegrated into the speech separation network.Under the joint constraint of the two tasks,both the auxiliary information and speech separation features are obtained simultaneously.Subsequently,these deep features are combined to enhance the separa‐tion capability of the model for the mixed speech of different gender combinations.By incorporating the GCD task for the mixed speech as a secondary task in the speech separation network,parameter sharing is achieved between the main and secondary tasks,thereby strengthening the speech separation capability for the primary task.Compared with frequency domain methods,the proposed FCN-GCD method in the time domain eliminates the necessity for phase recovery and frequency-to-time recon‐struction,which simplifies the processing and improves computational efficiency.Furthermore,it can extract effective auxiliary information features from the GCD task for mixed speech,achieving more

关 键 词:深度神经网络 语音分离 全卷积神经网络 特征融合 多任务学习 

分 类 号:TN912.3[电子电信—通信与信息系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象