基于日志信息的不可重复构建原因分类

Classification of Unreproducible Build Causes Based on Log Information

作　　者：马钊刘东任志磊[1] 江贺[1] MA Zhao;LIU Dong;REN Zhi-lei;JIANG He(School of Software Engineering,Dalian University of Technology,Dalian,Liaoning 116620,China)

机构地区：[1]大连理工大学软件学院,辽宁大连116620

出　　处：《计算机科学》2022年第12期109-117,共9页Computer Science

摘　　要：可重复构建指在预定义的构建环境下重新创建二进制工件的能力。由于可重复构建具有保证软件构建环境安全和提高软件构建和分发效率的作用,许多开源软件存储库(如Debian)开展了软件可重复构建实践。然而,由于缺乏足够的判断信息和源文件的复杂多样,确定软件不可重复构建的原因仍是一项费时费力的工作。为此,研究了基于机器学习的软件不可重复构建原因的分类检测。研究了4种典型的不可重复构建原因,即时间戳、文件顺序、随机性和语言环境。利用word2vec产生的词向量对文本日志进行表示,然后配合logistic回归模型,对差异日志和构建日志合并的文本语料进行学习和训练,从而实现对不可重复构建原因的自动分类。对算法进行了实现,并在671个不可重复构建的Debian软件包上进行实验,实验结果表明,该方法达到了80.75%的宏平均精度和86.07%的宏平均召回率,优于其他常用的机器学习算法。此外,还分析了差异日志和构建日志的相关性和重要性,实验结果表明两者对不可重复构建原因的分类都非常重要,缺一不可。该方法为不可重复构建原因自动分类提供了可靠的研究依据。Reproducible build is the ability to recreate binary artifacts in a predefined build environment.Due to the role of reproducible build in ensuring the security of software construction environment and improving the efficiency of software construction and distribution,many open source software repositories(such as Debian)have carried out software reproducible build practice.However,due to the lack of sufficient judgment information and the complexity and diversity of source files,it is still a time-consuming and laborious challenge to determine why software can not be built reproducibly.In order to overcome this challenge,this paper studies the classification and detection of software unreproducible build causes based on machine learning.This paper stu-dies four typical reasons for unreproducible build,namely timestamp,fileordering,randomness and locale.This method uses the word vector generated by word2vec to represent the text log,and then cooperates with the logistic regression model to learn and train the text corpus combined with the difference log and the build log,so as to realize the automatic classification of the causes of unreproducible build.In this paper,the algorithm is implemented and tested on 671 unreproducible build Debian software packa-ges.Experimental results show that our method achieves a macro average precision of 80.75%and a macro average recall of 86.07%,which are better than other commonly used machine learning algorithms.In addition,we also analyze the relevance and importance of difference log and build log.Result indicates that both of them are significant for the classification of unreproducible build causes.This method provides a reliable research basis for automatic classification of unreproducible build causes.

关键词：可重复构建原因分类差异日志构建日志机器学习

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于日志信息的不可重复构建原因分类

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于日志信息的不可重复构建原因分类

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索