基于Involution算子和交叉注意力机制的合成语音检测方法  

Research on Synthetic Speech Detection Based on Involution Operator and Cross Attention Mechanism

在线阅读下载全文

作  者:邓泗波 芦天亮[1] 彭舒凡 刘晓文 于子健 DENG Sibo;LU Tianliang;PENG Shufan;LIU Xiaowen;YU Zijian(School of Information and Cyber Security,People's Public Security University of China,Beijing 100038,China)

机构地区:[1]中国人民公安大学信息网络安全学院,北京100038

出  处:《中国人民公安大学学报(自然科学版)》2023年第3期65-72,共8页Journal of People’s Public Security University of China(Science and Technology)

基  金:国家社会科学基金重大项目(21&ZD193)。

摘  要:随着科学技术的迅速发展,基于深度学习生成的合成语音给语音认证系统和网络空间安全带来了新的挑战。针对现有检测模型准确率较低和语音特征挖掘不够充分的问题,提出了一种基于Involution算子和交叉注意力机制改进的合成语音检测方法。前端将语音数据提取线性频率倒谱系数(LFCC)特征和恒定Q变换(CQT)谱图特征,两个特征分别输入到后端的双分支网络中。后端网络使用ResNet18作为主干网络先进行浅层的特征学习,并将Involution算子嵌入主干网络,扩大特征图像学习区域,增强在空间范围内学习到的频谱图像特征信息。同时在训练分支之后引入cross-attention交叉注意力机制,使LFCC特征和CQT谱图特征构建交互的全局信息,强化模型对特征的深层挖掘。所提模型在ASVspoof 2019 LA测试集上取得了0.84%的等错误率和0.026的最小归一化串联检测代价函数的实验结果,展现了优于主流的检测模型。结果表明,改进的模型能够有效融合不同的频谱特征,提高模型的特征学习能力,从而强化模型的检测能力。With the rapid development of science and technology,synthetic speech based on deep learning has posed new challenges to speech authentication systems and cyberspace security.In response to the problems of low accuracy of existing detection models and insufficient speech feature mining,an improved synthetic speech detection method is proposed based on the Involution operator and cross attention mechanism.The front-end extracts linear frequency cepstral coefficient(LFCC)features and the constant Q transform(CQT)spectrogram features from speech data,and these two features are respectively input into the back-end dual branch network.The backend network takes ResNet18 as the backbone network for shallow feature learning,and the Involution operator is embedded into the backbone network to expand the feature image learning area and enhance the spectral image feature information learned within the spatial range.At the same time,the cross-attention mechanism is introduced after training the branches,which generates interactive global information between LFCC features and CQT spectral features,strengthening the model's deep mining of features.The proposed model achieves an EER of 0.84% and min-tDCF of 0.026 on the ASVspoof 2019 LA evaluation set,better than the mainstream detection models.The results show that the improved model can effectively fuse different spectral features,improve the feature learning ability of the model,and thus strengthen the model's detection ability.

关 键 词:合成语音检测 特征融合 Involution算子 注意力机制 

分 类 号:D918.2[政治法律—法学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象