基于统计推理的二进制程序语义比较模型

Semantic Comparison Model for Binary Programs Based on Statistical Reasoning

作　　者：郭曦王盼[2] GUO Xi;WANG Pan(College of Informatics,Huazhong Agricultural University,Wuhan,Hubei 430070,China;Hubei Key Laboratory for High-Efficiency Utilization of Solar Energy and Operation Control of Energy Storage System,Hubei University of Technology,Wuhan,Hubei 430068,China)

机构地区：[1]华中农业大学信息学院,湖北武汉430070 [2]湖北工业大学太阳能高效利用及储能运行控制湖北省重点实验室,湖北武汉430068

出　　处：《电子学报》2025年第1期163-181,共19页Acta Electronica Sinica

基　　金：国家自然科学基金(No.61502194);国家重点研发计划(No.2023YFF1000100);湖北省教育厅科学技术研究项目(No.Q20211405);湖北工业大学博士科研启动基金项目(No.XJ2021003601)。

摘　　要：在程序缺陷分析、恶意代码发掘等过程中,通常需要对二进制程序的行为相似性进行分析.目前基于语法的相似性分析方法忽略了程序的执行语义,存在分析精度不高的问题.基于语义的相似性分析方法在符号逻辑公式生成过程中,频繁地调用约束求解器进行语义相似性比较,会产生巨大的计算开销.提出一种基于统计推理的代码相似性模糊匹配分析方法,从指令级别相似度的计算开始,逐级对基本块及函数间的语义相似性进行推理.首先将二进制代码按照一定的规则划分为具有规范形式的片段集合,在基本块粒度上使用动态规划的方法构建有相同执行语义的存储表,从而获得基本块间的指令初始语义映射.然后通过邻域搜索的方法将该映射拓展到目标分析函数,并在该过程中提取函数的执行语义.最后通过对相似函数的结果进行统计分析,进而计算二进制文件的相似度.同时采用无监督的预训练分析方法,通过调优预训练模型的参数从而提高代码相似分析的精度.从跨平台及优化选项的角度对13个主流的开源项目进行了实验,实验结果表明相较于对比工具,本文方法的分析精度平均提高7.26%,同时消融实验表明,本文的预训练模型可以有效提高二进制程序语义匹配的性能.In the process of program defects and malicious code discovery,it is necessary to analyze the behavioral similarity of binary programs.Currently,syntax-based similarity analysis methods often ignore the execution semantics of the program,resulting in low analysis accuracy;In the process of generating symbolic logic formulas,semantic based analysis methods frequently call constraint solvers for semantic similarity comparison,resulting in significant time overhead.This article proposes a code similarity fuzzy matching analysis method based on statistical inference for binary programs.Starting from the calculation of instruction level similarity,the semantic similarity between basic blocks and functions is inferred step by step.Firstly,the binary code is divided into a set of fragments with a standardized form according to certain rules,and dynamic programming is used at the basic block granularity to construct a storage table with the same execution semantics for the longest common subsequence,thereby obtaining the initial semantic mapping of instructions between basic blocks;Then,the mapping is extended to the target analysis code through neighborhood search,and the execution semantics of the fragments are learned during this process;Finally,statistical analysis is performed on the results of similar fragments to calculate the similarity of binary codes.During the experiment,an unsupervised pre training analysis method was used to improve the accuracy of code similarity analysis by tuning the pre training model parameters.Experiments were conducted on 13 mainstream open-source projects from the perspective of cross platform and optimization options.The experimental results showed that compared to the comparison tools,the analysis accuracy of our method improved by an average of 7.26%,Meanwhile,ablation experiments have shown that the pre trained model proposed in this paper can effectively improve the semantic matching performance of binary programs.

关键词：程序分析语义比较逆向工程统计推理迁移学习

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于统计推理的二进制程序语义比较模型

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于统计推理的二进制程序语义比较模型

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索