基于时域波形的半监督端到端虚假语音检测方法  被引量:2

Semi-supervised end-to-end fake speech detection method based on time-domain waveforms

在线阅读下载全文

作  者:方昕[1,2] 黄泽鑫 张聿晗 高天 潘嘉 付中华 高建清 刘俊华 邹亮 FANG Xin;HUANG Zexin;ZHANG Yuhan;GAO Tian;PAN Jia;FU Zhonghua;GAO Jianqing;LIU Junhua;ZOU Liang(National Engineering Laboratory for Speech and Language Information Processing(University of Science and Technology of China),Hefei Anhui 230027,China;AI Institute,iFLYTEK Company Limited,Hefei Anhui 230088,China;School of Information and Control Engineering,China University of Mining and Technology,Xuzhou Jiangsu 221116,China;Xi'an iFLYTEK Hyper-brain Information Technology Company Limited,Xi'an Shaanxi 710000,China)

机构地区:[1]语音及语言信息处理国家工程实验室(中国科学技术大学),合肥230027 [2]科大讯飞股份有限公司AI研究院,合肥230088 [3]中国矿业大学信息与控制工程学院,江苏徐州221116 [4]西安讯飞超脑信息科技有限公司,西安710000

出  处:《计算机应用》2023年第1期227-231,共5页journal of Computer Applications

基  金:科技创新2030——“新一代人工智能”重大项目(2020AAA0103600)。

摘  要:现代语音合成和音色转换系统产生的虚假语音对自动说话人识别系统构成了严重威胁。大多数现有的虚假语音检测系统对在训练中已知的攻击类型表现良好,但对实际应用中的未知攻击类型检测效果显著降低。因此,结合最近提出的双路径Res2Net(DP-Res2Net),提出一种基于时域波形的半监督端到端虚假语音检测方法。首先,为了解决训练数据集和测试数据集两者数据分布差异较大的问题,采用半监督学习进行领域迁移;然后,对于特征工程,直接将时域采样点输入DP-Res2Net中,增加局部的多尺度信息,并充分利用音频片段之间的依赖性;最后,输入特征经过浅层卷积模块、特征融合模块、全局平均池化模块得到嵌入张量,用来判别自然语音与虚假伪造语音。在公开可用的ASVspoof 2021 Speech Deep Fake评估集和VCC数据集上评估了所提出方法的性能,实验结果表明它的等错误率(EER)为19.97%,与官方最优基线系统相比降低了10.8%。基于时域波形的半监督端到端检测虚假语音检测方法面对未知攻击时是有效的,且具有更高的泛化能力。The fake speech produced by modern speech synthesis and timbre conversion systems poses a serious threat to the automatic speaker recognition system.Most of the existing fake speech detection systems perform well for the known attack types in the training process,but degrades significantly in detecting unknown attack types in practical applications.Therefore,combined with the recently proposed Dual-Path Res2Net(DP-Res2Net),a semi-supervised end-to-end fake speech detection method based on time-domain waveforms was proposed.Firstly,semi-supervised learning was adopted for domain transfer to reduce the difference of data distribution between training set and test set.Then,for feature engineering,time-domain sampling points were input into DP-Res2Net directly,which increased the local multi-scale information and made full use of the dependence between audio segments.Finally,the embedded tensors were obtained to judge fake speech from natural speech after the input features going through the shallow convolution module,feature fusion module and global average pooling module.The performance of the proposed method was evaluated on the publicly available ASVspoof 2021Speech Deep Fake evaluation set as well as the dataset VCC(Voice Conversion Challenge).Experimental results show that the Equal Error Rate(EER)of the proposed method is 19.97%,which is 10.8%less than that of the official optimal baseline system,verifying that the semi-supervised end-to-end fake speech detection method based on time-domain waveforms is effective when recognizing unknown attacks and has higher generalization capability.

关 键 词:虚假语音检测 语音合成 音色转换 说话人识别 时域 半监督学习 

分 类 号:TP391.5[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象