基于倒谱特征数据增强的真实场景合成语音检测

Real scene synthetic speech detection based on cepstral feature data augmentation

作　　者：万伊李春国[3] 杨飞然[1] 杨军[1,2] WAN Yi;LI Chunguo;YANG Feiran;YANG Jun(Institute of Acoustics,Chinese Academy of Sciences,Beijing 100190;University of Chinese Academy of Sciences,Beijing 100049;School of Information Science and Engineering,Southeast University,Nanjing 210096)

机构地区：[1]中国科学院声学研究所,北京100190 [2]中国科学院大学,北京100049 [3]东南大学信息科学与工程学院,南京210096

出　　处：《高技术通讯》2024年第10期1013-1023,共11页Chinese High Technology Letters

基　　金：国家自然科学基金(62171438);北京市自然科学基金(4242013);中国科学院声学研究所自主部署“前沿探索”类项目(QYTS202111)资助。

摘　　要：现有合成语音检测系统在真实场景下性能损失严重。本文提出了一种基于频域掩蔽的倒谱特征数据增强方法。该方法对输入信号的线性滤波器组特征(LFBs)进行频域掩蔽,以引入符合真实场景的语音失真;计算掩蔽特征的线性频率倒谱系数(LFCC),以降低特征维度,提升检测性能。本文利用轻量级卷积神经网络(LCNN)、深度残差网络(ResNet)和一维卷积Transformer模型(OCT)建立了3种检测系统,用于验证所提方法的有效性。真实场景数据集上的实验结果表明,所提方法可使不同合成语音检测系统的等错误率(EER)相较无增强的基线降低6.39%~25.95%。将所提方法与基于音频编解码的增强技术相结合时,不同系统的EER比基线降低31.71%~42.47%,进一步提升了系统对真实场景的泛化能力,且性能优于现有数据增强方法。The performance of existing synthetic speech detection systems is significantly degraded in real scenarios.This paper proposes a data augmentation method for cepstral features via frequency masking.First,linear filter banks(LFBs)of the input signal are masked on frequency channels for realistic speech distortion.Then,the linear frequency cepstral coefficients(LFCC)of the masked features are calculated to reduce the feature dimensionality and improve the detection performance.Using light convolutional neural network(LCNN),deep residual network(ResNet)and one-dimensional convolutional Transformer(OCT),three detection systems are established to verify the effectiveness of the proposed method.Experiments on the real scene datasets show that the proposed method can reduce the equal error rate(EER)of different synthetic speech detection systems by 6.39%-25.95%compared with the baseline without augmentation.The proposed method with the codec-based augmentation can reduce the EER of different systems by 31.71%-42.47%compared with the baseline,which further improves the generalization ability of the systems in real scenarios,and outperforms the existing data augmentation methods.

关键词：合成语音检测数据增强真实场景频域掩蔽泛化能力

分类号：TN912.3[电子电信—通信与信息系统]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于倒谱特征数据增强的真实场景合成语音检测

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于倒谱特征数据增强的真实场景合成语音检测

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索