基于预训练汇编指令表征的二进制代码相似性检测方法  被引量:3

Binary Code Similarity Detection Method Based on Pre-training Assembly Instruction Representation

在线阅读下载全文

作  者:王泰彦 潘祖烈 于璐 宋景彬 WANG Taiyan;PAN Zulie;YU Lu;SONG Jingbin(College of Electronic Engineering,National University of Defense Technology,Hefei 230037,China;Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation,Hefei 230037,China;PLA 31401,Changchun 130022,China)

机构地区:[1]国防科技大学电子对抗学院,合肥230037 [2]网络空间安全态势感知与评估安徽省重点实验室,合肥230037 [3]31401部队,长春130022

出  处:《计算机科学》2023年第4期288-297,共10页Computer Science

基  金:国家重点研发计划(2021YFB3100500)。

摘  要:二进制代码相似性检测技术近年来被广泛用于漏洞函数搜索、恶意代码检测与高级程序分析等领域,而由于程序代码与自然语言有一定程度的相似性,研究人员开始借助预训练等自然语言处理的相关技术来提高检测准确度。针对现有方法中未考虑程序指令概率特征导致的准确率提升瓶颈,提出了一种基于预训练汇编指令表征技术的二进制代码相似性检测方法。设计了面向多架构汇编指令的分词方法,并在控制流与数据流关系基础上,考虑指令间顺序出现的概率与各个指令单元使用的频率等特征设计预训练任务,以实现对指令更好的向量化表征;结合预训练汇编指令表征方法,对二进制代码相似性检测下游任务进行改进,使用表征向量替换统计特征作为指令与基本块的表征,以提高检测准确率。实验结果表明,与现有方法相比,所提方法在指令表征能力方面最高提升23.7%,在基本块搜索准确度上最高提升33.97%,在二进制代码相似性检测的检出数量上最高增加4倍。Binary code similarity detection has been widely used in vulnerability searching,malware detection,advanced program analysis and other fields in recent years,while program code is similar to natural language in a degree,researchers start to use pre-training and other natural language processing related technologies to improve accuracy.A binary code similarity detection method based on pre-training assembly instruction representation is proposed to deal with the accuracy bottleneck due to insufficient consideration of instruction probability features.It includes tokenization method for multi-arch assembly instructions,and pre-trai-ning tasks that considering control flow,data flow,instruction logic and probability of occurrence,to achieve better vectorized representation of instructions.Downstream binary code similarity detection task is improved by combining pre-training method to gain accuracy boost.Experiments show that,compared with the existing methods,the proposed method improves instruction representing performance by 23.7%at the maximum,and improves block searching ability and similarity detection performance by up to 33.97%and 400%respectively.

关 键 词:二进制代码 相似性检测 指令表征 分词方法 预训练任务 

分 类 号:TP313[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象