检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:邓飞[1] 邓力洪 胡文艺[1] 张葛祥 杨强 Deng Fei;Deng Lihong;Hu Wenyi;Zhang Gexiang;Yang Qiang(School of Computer&Network Security(Oxford Brookes College),Chengdu University of Technology,Chengdu 610059,China;Artificial Intelligence Research Center,Chengdu University of Technology,Chengdu 610059,China;School of Control Engineering,Chengdu University of Information Technology,Chengdu 610059,China)
机构地区:[1]成都理工大学计算机与网络安全学院(牛津布鲁克斯学院),成都610059 [2]成都理工大学人工智能研究中心,成都610059 [3]成都信息工程大学控制工程学院,成都610059
出 处:《计算机应用研究》2022年第3期721-725,共5页Application Research of Computers
基 金:国家自然科学基金资助项目(61972324);四川省科技计划资助项目(2021YFS0313,2021YFG0133)。
摘 要:说话人身份识别是一项重要的生物识别技术,多种基于深度卷积神经网络(DNN)的模型结构表现出越来越强的特征表达能力,并形成了统一的端到端说话人识别系统,取得了优于传统识别模型的性能。其中聚合模型聚合的话语级特征是影响说话人识别系统准确率的关键因素之一。目前大多数的方法是使用self-attention pooling(SAP)聚合模型。然而SAP聚合模型经常会无法准确地进行帧选择,聚合出的话语级特征不准确、鲁棒性弱。在SAP聚合模型的聚合方式上进行了改进,通过引入平均向量方法,构建了一种改进的聚合模型mSAP。它以一种更细粒化和更稳定的工作方式,将变长的输入序列聚合为话语级特征,可以更有效地捕捉输入序列的长期变化。实验表明,mSAP模型的等错误率(EER)相较于TAP、SAP、NetVLAD聚合模型分别有7.4、1.75和0.24的下降,而DCF值相较于这三种聚合模型分别有0.018、0.137和0.242的下降。改进的mSAP聚合模型能够聚合出鲁棒性更强、更准确的话语级特征,有效地提高了端到端说话人识别模型的性能。Speaker identification is an important biometric technology, and multiple deep convolutional neural network(DNN)-based model architectures have shown increasing feature representation capabilities and have resulted in unified end-to-end speaker identification systems that have achieved better performance than traditional recognition models.Among them, the speech level features aggregated by the aggregation model are one of the key factors affecting the accuracy of the speaker recognition system.Most current approaches use the self-attention pooling(SAP) aggregation model.However, SAP aggregation models often fail to perform frame selection accurately, and the aggregated speech level features are inaccurate and weakly robust.This paper constructed an improved aggregation model mSAP by introducing a mean vector approach to the aggregation approach of the SAP aggregation model.It worked in a more fine-grained and stable way to aggregate variable-length input sequences into discourse-level features, which could capture long-term changes in the input sequences more effectively.Experiments show that the equal error rate(EER) of the mSAP model decreases by 7.4,1.75,and 0.24 compared to the TAP,SAP,and NetVLAD aggregation models, respectively, while the DCF values decrease by 0.018,0.137,and 0.242 compared to these three aggregation models, respectively.The improved mSAP aggregation model is able to aggregate more robust and accurate discourse-level features effectively improving the performance of the end-to-end speaker recognition model.
分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.144.178.82