基于多路信息聚合协同解码的单通道语音增强

Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding

作　　者：莫尚斌王文君董凌[1,2,3] 高盛祥余正涛[1,2,3] MO Shangbin;WANG Wenjun;DONG Ling;GAO Shengxiang;YU Zhengtao(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming Yunnan 650500,China;Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology),Kunming Yunnan 650500,China;Yunnan Provincial Key Laboratory of Media Integration,Kunming Yunnan 650228,China)

机构地区：[1]昆明理工大学信息工程与自动化学院,昆明650500 [2]云南省人工智能重点实验室(昆明理工大学),昆明650500 [3]云南省媒体融合重点实验室,昆明650228

出　　处：《计算机应用》2024年第8期2611-2617,共7页journal of Computer Applications

基　　金：国家自然科学基金资助项目(61972186,U21B2027);云南高新技术产业发展项目(201606);云南省重大科技专项计划项目(202103AA080015);云南省基础研究计划项目(202001AS070014);云南省科技人才与平台计划项目(202105AC160018);云南省媒体融合重点实验室开放课题(220225702)。

摘　　要：为了改善基于卷积编解码架构的单通道语音增强网络对语音声学特征提取不充分、解码特征丢失严重的问题,提出一种基于多路信息聚合协同解码的单通道语音增强网络MIACD,通过双路编码器充分提取融入了语音自监督学习(SSL)表征的幅度谱和复数谱特征,由4层Conformer分别从时间和频率维度对提取特征建模,采用残差连接将双路编码器提取的语音幅度、复数特征引入三路信息聚合解码器,并利用所提通道-时频注意力(CTF-Attention)机制根据语音能量分布情况调节解码器中聚合信息,有效缓解解码时可用声学信息缺失严重的问题。在公开数据集Voice Bank DEMAND上的实验结果表明,与用于单通道语音增强的协作学习框架(GaGNet)相比,MIACD在客观评价指标宽带感知评估语音质量(WB-PESQ)上提升了5.1%,短时客观可懂度(STOI)达到96.7%,验证所提方法可充分利用语音信息重构信号,有效抑制噪声并提升语音可理解性。In order to address the issues of insufficient acoustic feature extraction and severe decoding feature loss in single-channel speech enhancement networks based on convolutional encoder-decoder architecture,a single-channel speech enhancement network called Multi-Channel Information Aggregation and Collaborative Decoding(MIACD)was proposed.A dual-channel encoder was utilized to extract the speech magnitude spectrum and complex spectrum features,which were enriched with Self-Supervised Learning(SSL)representations.A four-layer Conformer block was employed to model the extracted features in time and frequency domains.By incorporating residual connections,the speech magnitude and complex features extracted by the dual-channel encoder were introduced into a three-channel information aggregation decoder.Additionally,a Channel-Time-Frequency Attention(CTF-Attention)mechanism was proposed to adjust the aggregated information in the decoder based on the distribution of speech energy,effectively alleviating the problem of severe acoustic information loss during decoding.Experimental results on the publicly available dataset Voice Bank DEMAND demonstrate that,compared to Glance and Gaze:a collaborative learning framework for Single-channel speech enhancement(GaGNet),the proposed method achieves a 5.1%improvement on the objective metric WB-PESQ(Wide Band Perceptual Evaluation of Speech Quality)and 96.7%on STOI(Short-Time Objective Intelligibility),validating that the proposed method effectively utilizes speech information for signal reconstruction,noise suppression,and speech intelligibility enhancement.

关键词：声学特征多路信息聚合双路编码器三路信息聚合解码器通道-时频注意力机制

分类号：TN912.35[电子电信—通信与信息系统]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多路信息聚合协同解码的单通道语音增强

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多路信息聚合协同解码的单通道语音增强

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索