大语言模型对抗性攻击与防御综述  

Survey of Adversarial Attacks and Defenses for Large Language Models

作  者:台建玮 杨双宁 王佳佳 李亚凯 刘奇旭[2] 贾晓启[2] Tai Jianwei;Yang Shuangning;Wang Jiajia;Li Yakai;Liu Qixu;Jia Xiaoqi(School of Internet,Anhui University,Hefei 230039;Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093)

机构地区:[1]安徽大学互联网学院,合肥230039 [2]中国科学院信息工程研究所,北京100093

出  处:《计算机研究与发展》2025年第3期563-588,共26页Journal of Computer Research and Development

基  金:国家自然科学基金面上项目(71971002);安徽省自然科学基金项目(2108085QA35)。

摘  要:随着自然语言处理与深度学习技术的快速发展,大语言模型在文本处理、语言理解、图像生成和代码审计等领域中的应用不断深入,成为了当前学术界与工业界共同关注的研究热点.然而,攻击者可以通过对抗性攻击手段引导大语言模型输出错误的、不合伦理的或虚假的内容,使得大语言模型面临的安全威胁日益严峻.对近年来针对大语言模型的对抗性攻击方法和防御策略进行总结,详细梳理了相关研究的基本原理、实施方法与研究结论.在此基础上,对提示注入攻击、间接提示注入攻击、越狱攻击和后门攻击这4类主流的攻击模式进行了深入的技术探讨.更进一步地,对大语言模型安全的研究现状与未来方向进行了探讨,并展望了大语言模型结合多模态数据分析与集成等技术的应用前景.With the rapid development of natural language processing and deep learning technologies,large language models(LLMs)have been increasingly applied in various fields such as text processing,language understanding,image generation,and code auditing.These models have become a research hotspot of common interest in both academia and industry.However,adversarial attack methods allow attackers to manipulate large language models into generating erroneous,unethical,or false content,posing increasingly severe security threats to these models and their wide-ranging applications.This paper systematically reviews recent advancements in adversarial attack methods and defense strategies for large language models.It provides a detailed summary of fundamental principles,implementation techniques,and major findings from relevant studies.Building on this foundation,the paper delves into technical discussions of four mainstream attack modes:prompt injection attacks,indirect prompt injection attacks,jailbreak attacks,and backdoor attacks.Each is analyzed in terms of its mechanisms,impacts,and potential risks.Furthermore,the paper discusses the current research status and future directions of large language models security,and outlooks the application prospects of large language models combined with multimodal data analysis and integration technologies.This review aims to enhance understanding of the field and foster more secure,reliable applications of large language models.

关 键 词:大语言模型 对抗性攻击 防御策略 网络空间安全 生成式人工智能 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象