检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:许铭洋 王华朋[1] 闫道申 杨海涛[1] 楚宪腾 XU Mingyang;WANG Huapeng;YAN Daoshen;YANG Haitao;CHU Xianteng(College of Public Security Information Technologg and Intelligence,Criminal Investigation Police University of China,Shenyang 110854,China)
机构地区:[1]中国刑事警察学院公安信息技术与情报学院,沈阳110854
出 处:《刑事技术》2023年第5期466-472,共7页Forensic Science and Technology
基 金:国家重点研发计划(2017YFC0821000);司法部司法鉴定重点实验室项目(KF202117);广州市科技计划项目(2019030004)。
摘 要:为提高多说话人混合语音分割的准确度,本文提出了采用广义端到端损失函数训练说话人深度嵌入向量提取模型用于多说话人分割。该方法首先训练基于长短时记忆的深度神经网络作为深度嵌入向量提取器;其次,在音频文件中截取每个说话人的参考语音段并训练其嵌入向量;最后,比较音频文件的连续嵌入与每个说话人嵌入之间的余弦相似度得分,实现说话人分割。该方法采用先识别后分割的原理,在能够预知说话人数量的场景中有较好的分割效果,可以为多说话人自动识别系统自动分割目标说话人语音,提高工作效率。For many speech technologies,it is required to have only one speaker in a complete statement,or it will degrade the performance of the algorithm.Therefore,speaker diarization becomes an important front-end for this system in the presence of multiple speakers.In order to improve the performance of the speaker diarization system,in this study,a d-vector based speaker binarization method is proposed,which is trained by a generalized end-to-end(GE2E)loss function that achieves better performance on speakers verifi cation tasks.Firstly,the GE2E loss function is used to train a deep neural network(DNN)based on long short-term memory(LSTM)to extract the speaker-discriminative embeddings(d-vector),which is a neural network based on audio embeddings.Secondly,reference speech segments of each speaker are extracted from the input audio fi le and their embeddings are trained separately.While training continuous embeddings of the entire audio fi le,the cosine similarity of these continuous embeddings to the speakers’embeddings is computed.Finally,the segments with a cosine similarity score greater than 0.75 are stored in the audio file of the corresponding speaker.This method adopts the principle of first recognition and then segmentation,and experimental results show that the algorithm has excellent performance in the scene where the number of speakers can be predicted and the short speech of single speakers can be easily obtained.Therefore,it can be used in multispeaker automatic recognition system to automatically segment the target speaker’s voice to improve the performance.
关 键 词:说话人分割 长短时记忆 广义端到端 音频嵌入 余弦相似度
分 类 号:TN912.3[电子电信—通信与信息系统]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.117.172.251