基于双重信息检索的Bash代码注释生成方法  被引量:4

Bash Code Comment Generation Method Based on Dual Information Retrieval

在线阅读下载全文

作  者:陈翔[1,2] 于池 杨光 濮雪莲 崔展齐 CHEN Xiang;YU Chi;YANG Guang;PU Xue-Lian;CUI Zhan-Qi(School of Information Science and Technology,Nantong University,Nantong 226019,China;State Key Laboratory of Information Security(Institute of Information Engineering,Chinese Academy of Sciences),Beijing 100093,China;Economics and Management School,Nantong University,Nantong 226019,China;School of Computer,Beijing Information Science and Technology University,Beijing 100101,China)

机构地区:[1]南通大学信息科学技术学院,江苏南通226019 [2]信息安全国家重点实验室(中国科学院信息工程研究所),北京100093 [3]南通大学经济与管理学院,江苏南通226019 [4]北京信息科技大学计算机学院,北京100101

出  处:《软件学报》2023年第3期1310-1329,共20页Journal of Software

基  金:国家自然科学基金(61872263,61702041,61202006);信息安全国家重点实验室开放课题(2020-MS-07);江苏省前沿引领技术基础研究专项(BK20202001);江苏省重点产业专利导航项目(DH20200072-10)。

摘  要:Bash是Linux默认的shell命令语言.它在Linux系统的开发和维护中起到重要作用.对不熟悉Bash语言的开发人员来说,理解Bash代码的目的和功能具有一定的挑战性.针对Bash代码注释自动生成问题提出了一种基于双重信息检索的方法 ExplainBash.该方法基于语义相似度和词法相似度进行双重检索,从而生成高质量代码注释.其中,语义相似度基于CodeBERT和BERT-whitening操作训练出代码语义表示,并基于欧式距离来实现;词法相似度基于代码词元构成的集合,并基于编辑距离来实现.以NL2Bash研究中共享的语料库为基础,进一步合并NLC2CMD竞赛共享的数据以构造高质量语料库.随后,选择了来自代码注释自动生成领域的9种基准方法,这些基准方法覆盖了基于信息检索的方法和基于深度学习的方法.实证研究和人本研究的结果验证了ExplainBash方法的有效性.然后设计了消融实验,对ExplainBash方法内设定(例如检索策略、BERT-whitening操作等)的合理性进行了分析.最后,基于所提方法开发出一个浏览器插件,以方便用户对Bash代码的理解.Bash is the default shell command language for Linux, which plays an important role in the development and maintenance of Linux systems. Nevertheless, understanding the purpose and functionality of the Bash code is a challenging task. Therefore, a n automatic method ExplainBash is proposed based on dual information retrieval for automatic Bash code comment generation. Specifically, the proposed method is based on semantic similarity and lexical similarity to perform dual information retrieval, which aims to g enerate high-quality code comments. For semantic similarity, CodeBERT and BERT-whitening operator are used to learn the code semantic representation, and Euclidean distance is resorted to compute semantic similarity;while for lexical similarity, code is repr esented as a set of code tokens, then the edit distance is resorted to compute lexical similarity. A high-quality corpus is constructed based on the corpus shared in the NL2Bash study and the data shared in the NLC2CMD competition. After that, nine state-of-the-art baselines are selected from the automatic code comment generation domain, which cover the information retrieval-based methods and deep learning-based methods. Results of empirical study and human study verify the effectiveness of the proposed method. Ablation experiments are also designed to analyze the rationality of the settings(such as retrieval strategy, BERT-whitening operator) in the proposed method. Finally, a browser plug-in is developed based on the proposed method to facilitate the code comprehension of the Bash code.

关 键 词:程序理解 Bash代码 代码注释生成 信息检索 代码语义 代码词法 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象