检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:莫尚斌 王文君 董凌[1,2,3] 高盛祥 余正涛[1,2,3] MO Shangbin;WANG Wenjun;DONG Ling;GAO Shengxiang;YU Zhengtao(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming Yunnan 650500,China;Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology),Kunming Yunnan 650500,China;Yunnan Provincial Key Laboratory of Media Integration,Kunming Yunnan 650228,China)
机构地区:[1]昆明理工大学信息工程与自动化学院,昆明650500 [2]云南省人工智能重点实验室(昆明理工大学),昆明650500 [3]云南省媒体融合重点实验室,昆明650228
出 处:《计算机应用》2024年第8期2611-2617,共7页journal of Computer Applications
基 金:国家自然科学基金资助项目(61972186,U21B2027);云南高新技术产业发展项目(201606);云南省重大科技专项计划项目(202103AA080015);云南省基础研究计划项目(202001AS070014);云南省科技人才与平台计划项目(202105AC160018);云南省媒体融合重点实验室开放课题(220225702)。
摘 要:为了改善基于卷积编解码架构的单通道语音增强网络对语音声学特征提取不充分、解码特征丢失严重的问题,提出一种基于多路信息聚合协同解码的单通道语音增强网络MIACD,通过双路编码器充分提取融入了语音自监督学习(SSL)表征的幅度谱和复数谱特征,由4层Conformer分别从时间和频率维度对提取特征建模,采用残差连接将双路编码器提取的语音幅度、复数特征引入三路信息聚合解码器,并利用所提通道-时频注意力(CTF-Attention)机制根据语音能量分布情况调节解码器中聚合信息,有效缓解解码时可用声学信息缺失严重的问题。在公开数据集Voice Bank DEMAND上的实验结果表明,与用于单通道语音增强的协作学习框架(GaGNet)相比,MIACD在客观评价指标宽带感知评估语音质量(WB-PESQ)上提升了5.1%,短时客观可懂度(STOI)达到96.7%,验证所提方法可充分利用语音信息重构信号,有效抑制噪声并提升语音可理解性。In order to address the issues of insufficient acoustic feature extraction and severe decoding feature loss in single-channel speech enhancement networks based on convolutional encoder-decoder architecture,a single-channel speech enhancement network called Multi-Channel Information Aggregation and Collaborative Decoding(MIACD)was proposed.A dual-channel encoder was utilized to extract the speech magnitude spectrum and complex spectrum features,which were enriched with Self-Supervised Learning(SSL)representations.A four-layer Conformer block was employed to model the extracted features in time and frequency domains.By incorporating residual connections,the speech magnitude and complex features extracted by the dual-channel encoder were introduced into a three-channel information aggregation decoder.Additionally,a Channel-Time-Frequency Attention(CTF-Attention)mechanism was proposed to adjust the aggregated information in the decoder based on the distribution of speech energy,effectively alleviating the problem of severe acoustic information loss during decoding.Experimental results on the publicly available dataset Voice Bank DEMAND demonstrate that,compared to Glance and Gaze:a collaborative learning framework for Single-channel speech enhancement(GaGNet),the proposed method achieves a 5.1%improvement on the objective metric WB-PESQ(Wide Band Perceptual Evaluation of Speech Quality)and 96.7%on STOI(Short-Time Objective Intelligibility),validating that the proposed method effectively utilizes speech information for signal reconstruction,noise suppression,and speech intelligibility enhancement.
关 键 词:声学特征 多路信息聚合 双路编码器 三路信息聚合解码器 通道-时频注意力机制
分 类 号:TN912.35[电子电信—通信与信息系统]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49