机构地区:[1]兰州交通大学电子与信息工程学院,甘肃兰州730070
出 处:《信号处理》2025年第4期718-729,共12页Journal of Signal Processing
基 金:兰州交通大学及对口支援高校(LH2023002);兰州交通大学青年基金项目(LH2019005);内蒙古重点研发及成果转化项目(2023YFSH0043,2023YFDZ0043);甘肃省重点人才项目。
摘 要:针对在含有噪声和混响的复杂环境中对未知说话人语音分离任务的研究,提出了一种基于多尺度可变形注意力编码与多路径融合的未知说话人语音分离模型。现有的针对未知说话人的语音分离模型是在纯净的实验环境条件下分析的模型性能,不符合现实中复杂的背景环境需求。为使模型可以在现实应用复杂条件下灵活应对混合语音信号中的多变性与非平稳性,采用多尺度可变形注意力机制与Transformer编码器构成(Transformer Encoder Multi-Scale deformable attention,TEMDA)模块,利用多尺度可变形注意力机制的偏移层在不同位置上进行动态计算,扩展模型的感受野,同时使模型更有效地聚焦于重要的时间点,减少噪声和混响的影响。为了更好地获取上下文信息,在多路径融合策略中,通过在双路径模块的基础上增加通道间的Conformer组成三路径模块,用于提取多说话人之间的特征信息,这样的处理方式可以更好地融合单一说话人和多说话人之间的信息,提升语音分离性能。实验表明,所提出的模型分别在纯净和带噪声的Libri2Mix、Libri3Mix数据集上达到了显著的分离效果,并且在LRS2-2Mix数据集中模型可以更好地减少噪声和混响对语音分离的影响,尺度不变信噪比改善(Scale-Invariant Signal-to-Noise Ratio Improvement,SI-SNRi)和信号失真比改善(Signal-to-Distortion Ratio Improvement,SDRi)分别为14.7 dB和15.1 dB;在三个说话人数目中的估计精度为98.89%,提升了0.12%。This study proposed a novel model for unknown speaker speech separation that was designed to operate effectively in complex environments characterized by noise and reverberation.Existing models for unknown speaker separation typically evaluated performance under clean experimental conditions that do not reflect the demands of real-world settings.To enhance the adaptability of the model to the variability and non-stationarity of mixed speech signals encountered in practical applications,we integrated a multi-scale deformable attention mechanism with a Transformer encoder to form the transformer encoder multi-scale deformable attention module.This approach enabled dynamic computation at various positions through the offset layers of the multiscale deformable attention mechanism,thereby expanding the receptive field of the model and allowing it to focus more effectively on crucial temporal points while simultaneously mitigating the adverse impacts of noise and reverberation.Additionally,to improve the acquisition of contextual information,we adopted a multipath fusion strategy that augmented the dual-path module with a Conformer layer,resulting in a three-path module.This design facilitated the extraction of feature information among multiple speakers,thereby enhancing the ability of the model to fuse information from both single and multiple speakers,which is critical for improving speech separation performance.Experimental results demonstrated that the proposed model achieved significant separation efficacy on both the clean and noisy Libri2Mix and Libri3Mix datasets.Notably,on the LRS2-2Mix dataset,the model exhibited improved resilience against noise and reverberation,achieving Scale-Invariant Signal-to-Noise Ratio Improvement(SI-SNRi)and Signal-to-Distortion Ratio Improvement(SDRi)scores of 14.7 dB and 15.1 dB,respectively.Furthermore,the model attained an estimation accuracy of 98.89%across varying speaker counts,thereby displaying an improvement of 0.12%.These findings indicate that the proposed model is well
关 键 词:未知说话人语音分离 多尺度可变形注意力编码策略 多路径融合 吸引子估计
分 类 号:TN912.3[电子电信—通信与信息系统]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...