语音深度伪造及其检测技术研究进展  被引量:2

Research progress on speech deepfake and its detection techniques

在线阅读下载全文

作  者:许裕雄 李斌 谭舜泉[1,2,4] 黄继武 Xu Yuxiong;Li Bin;Tan Shunquan;Huang Jiwu(Guangdong Key Laboratory of Intelligent Information Processing,Shenzhen 518060,China;Shenzhen Key Laboratory of Media Security,Shenzhen 518060,China;College of Electronics and Information Engineering,Shenzhen University,Shenzhen 518060,China;College of Computer Science and Software Engineering,Shenzhen University,Shenzhen 518060,China)

机构地区:[1]广东省智能信息处理重点实验室,深圳518060 [2]深圳市媒体信息内容安全重点实验室,深圳518060 [3]深圳大学电子与信息工程学院,深圳518060 [4]深圳大学计算机与软件学院,深圳518060

出  处:《中国图象图形学报》2024年第8期2236-2268,共33页Journal of Image and Graphics

基  金:国家自然科学基金项目(U23B2022,U22B2047,62272314);广东省基础与应用基础研究基金项目(2019B151502001);深圳市基础研究重点项目(JCYJ20200109105008228);亚马逊云科技——2022教育部就业育人项目(20221128)。

摘  要:语音深度伪造技术是利用深度学习方法进行合成或生成语音的技术。人工智能生成内容技术的快速迭代与优化,推动了语音深度伪造技术在伪造语音的自然度、逼真度和多样性等方面取得显著提升,同时也使得语音深度伪造检测技术面临着巨大挑战。本文对语音深度伪造及其检测技术的研究进展进行全面梳理回顾。首先,介绍以语音合成(speech synthesis,SS)和语音转换(voice conversion,VC)为代表的伪造技术。然后,介绍语音深度伪造检测领域的常用数据集和相关评价指标。在此基础上,从数据增强、特征提取和优化以及学习机制等处理流程的角度对现有的语音深度伪造检测技术进行分类与深入分析。具体而言,从语音加噪、掩码增强、信道增强和压缩增强等数据增强的角度来分析不同增强方式对伪造检测技术性能的影响,从基于手工特征的伪造检测、基于混合特征的伪造检测、基于端到端的伪造检测和基于特征融合的伪造检测等特征提取和优化的角度对比分析各类方法的优缺点,从自监督学习、对抗训练和多任务学习等学习机制的角度对伪造检测技术的训练方式进行探讨。最后,总结分析语音深度伪造检测技术存在的挑战性问题,并对未来研究进行展望。本文汇总的相关数据集和代码可在https://github.com/media-sec-lab/Audio-Deepfake-Detection访问。Speech deepfake technology,which employs deep learning methods to synthesize or generate speech,has emerged as a critical research hotspot in multimedia information security.The rapid iteration and optimization of artificial intelligence-generated content technologies have significantly advanced speech deepfake techniques.These advancements have significantly enhanced the naturalness,fidelity,and diversity of synthesized speech.However,they have also pre⁃sented great challenges for speech deepfake detection technology.To address these challenges,this study comprehensively reviews recent research progress on speech deepfake generation and its detection techniques.Based on an extensive litera⁃ture survey,this study first introduces the research background of speech forgery and its detection and compares and ana⁃lyzes previously published reviews in this field.Second,this study provides a concise overview of speech deepfake genera⁃tion,especially speech synthesis(SS)and voice conversion(VC).SS,which is commonly known as text-to-speech(TTS),analyzes text and generates speech that aligns with the provided input by applying linguistic rules for text descrip⁃tion.Various deep models are employed in TTS,including sequence-to-sequence models,flow models,generative adver⁃sarial network models,variational auto-encoder models,and diffusion models.VC involves modifying acoustic features,such as emotion,accent,pronunciation,and speaker identity,to produce speech resembling human-like speech.VC algo⁃rithms can be categorized as single,multiple,and arbitrary target speech conversion depending on the number of target speakers.Third,this study briefly introduces commonly used datasets in speech deepfake detection and provides relevant access links to open-source datasets.This study briefly introduces two commonly used evaluation metrics in speech deep⁃fake detection:the equal error rate and the tandem detection cost function.This study analyzes and categorizes the existing deep speech forgery detection techniques i

关 键 词:语音深度伪造 语音深度伪造检测 语音合成(SS) 语音转换(VC) 人工智能生成内容(AIGC) 自监督学习 对抗训练 

分 类 号:TN912[电子电信—通信与信息系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象