基于视觉模态尺度融合的语音分离方法

Speech Separation Method Based on Visual Modal Scale Fusion

出　　处：《仪器与设备》2024年第3期315-328,共14页Instrumentation and Equipments

摘　　要：多模态语音分离方法融合视觉和听觉信息,提高单一听觉模态的分离性能。目前视听融合机制在模态特征尺度差异的问题上研究不足,影响视觉的高维语义信息表达和分离性能。因此,提出一种基于视觉模态尺度的融合方法,通过编码器降低听觉时序尺度并重建出包含视觉模态信息的语音特征。针对主流的分离基线模型,引入双尺度扩张卷积融合的时序卷积块,学习特征的多维信息,进一步提高语音分离方法的性能。在GRID数据集和TUT2016数据集上对提出的多模态语音分离方法进行评估。实验结果表明,与单模态基线模型和视听语音分离比较模型相比,分别提高了2.14 dB和0.82 dB,验证了所提方法的有效性。最后基于可解释性分析理论,将主干网络对分离性能的影响可视化,为后续结构设计和语音分离可解释性提供理论依据。The multimodal speech separation method integrates visual and auditory information to improve the separation performance of a single auditory mode. At present, the problem of modal feature scale difference in audiovisual fusion mechanism is insufficient, which affects the expression and separation performance of high-dimensional semantic information in vision. Therefore, a fusion method based on visual modal scale is proposed to reduce auditory timing scale and reconstruct speech features containing visual modal information by encoder. Aiming at the mainstream separation baseline model, a two-scale extended convolution fusion temporal convolution block is introduced to learn the multi-dimensional information of features, and the performance of speech separation method is further improved. The proposed multimodal speech separation method is evaluated on GRID dataset and TUT2016 dataset. The experimental results show that, the performance of the method is improved by 2.14 dB and 0.82 dB, respectively, compared with the single mode baseline model and the audio-visual speech separation comparison model, which verifies the effectiveness of the proposed method. Finally, based on the interpretability analysis theory, the influence of backbone network on the separation performance is visualized, which provides a theoretical basis for the subsequent structural design and the interpretability of speech separation.

关键词：多模态语音分离视听融合时序卷积时序尺度

分类号：TP3[自动化与计算机技术—计算机科学与技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于视觉模态尺度融合的语音分离方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于视觉模态尺度融合的语音分离方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索