基于Conformer的实时多场景说话人识别模型  被引量:1

Conformer-Based Speaker Recognition Model for Real-Time Multi-Scenarios

在线阅读下载全文

作  者:宣茜 韩润萍 高静欣 XUAN Xi;HAN Runping;GAO Jingxin(School of Arts and Sciences,Beijing Institute of Fashion Technology,Beijing 100029,China;School of Fashion,Beijing Institute of Fashion Technology,Beijing 100029,China)

机构地区:[1]北京服装学院文理学院,北京100029 [2]北京服装学院服装艺术与工程学院,北京100029

出  处:《计算机工程与应用》2024年第7期147-156,共10页Computer Engineering and Applications

基  金:北京市教委科技计划项目(KM202210012002);北京服装学院2022年研究生科研创新项目(X2022-110)。

摘  要:为解决在多场景(跨域、长时以及噪声干扰语音场景)下说话人确认系统性能较差的问题,提出了一种基于Conformer构建的、实时多场景鲁棒的说话人识别模型——PMS-Conformer。PMS-Conformer的设计灵感来自于先进的模型MFA-Conformer。PMS-Conformer对MFA-Conformer的声学特征提取器、网络组件和损失函数计算模块进行了改进,其具有新颖有效的声学特征提取器,以及鲁棒的、具有较强泛化能力的声纹嵌入码提取器。基于VoxCeleb1&2数据集实现了PMS-Conformer的训练;开展了PMS-Conformer与基线MFA-Conformer以及ECAPA-TDNN在说话人确认任务上的性能对比评估实验。实验结果表明在长语音SITW、跨域VoxMovies以及加噪处理的VoxCeleb-O测试集上,以PMS-Conformer构建的说话人确认系统的性能比用这两个基线构建的说话人确认系统更有竞争力;并且在声纹嵌入码提取器的可训练参数(Params)和推理速度(RTF)方面,PMS-Conformer明显优于ECAPA-TDNN。实验结果说明了PMS-Conformer在实时多场景下具有良好的性能。To handle the problems of poor performances of speaker verification systems,appearing in multiple scenarios with cross-domain utterances,long-duration utterances and noisy utterances,a real-time robust speaker recognition model,PMS-Conformer,is designed based on Conformer in this paper.The architecture of the PMS-Conformer is inspired by the state-of-the-art model named MFA-Conformer.PMS-Conformer has made the improvements on the acoustic feature extractor,network components and loss calculation module of MFA-Conformer respectively,having the novel and effective acoustic feature extractor and the robust speaker embedding extractor with high generalization capability.PMS-Conformer is trained on VoxCeleb1&2 dataset,and it is compared with the baseline MFA-Conformer and ECAPA-TDNN,and extensive comparison experiments are conducted on the speaker verification tasks.The experimental results show that on VoxMovies with cross-domain utterances,SITW with long-duration utterances and VoxCeleb-O processed by adding noise to its utterances,the ASV system built with PMS-Conformer is more competitive than those built with MFA-Conformer and ECAPA-TDNN respectively.Moreover,the trainable Params and RTF of the speaker embedding extractor of PMSConformer are significantly lower than those of ECAPA-TDNN.All evaluation experiment results demonstrate that PMSConformer exhibits good performances in real-time multi-scenarios.

关 键 词:说话人确认 MFA-Conformer Sub-center AAM-Softmax 声纹嵌入码 声学特征提取 

分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象