基于对比学习的跨语言代码克隆检测方法  

Contrastive learning based cross-language code clone detection

在线阅读下载全文

作  者:吕泉润 谢春丽[1] 万泽轩 魏家劲 Lyu Quanrun;Xie Chunli;Wan Zexuan;Wei Jiajin(School of Computer Science&Technology,Jiangsu Normal University,Xuzhou Jiangsu 221116,China)

机构地区:[1]江苏师范大学计算机科学与技术学院,江苏徐州221116

出  处:《计算机应用研究》2024年第7期2147-2152,共6页Application Research of Computers

基  金:国家自然科学基金面上基金资助项目(62276119);江苏师范大学研究生科研与实践创新计划资助项目(2022XKT1538)。

摘  要:代码克隆检测是提高软件开发效率、软件质量和可靠性的重要手段。基于抽象语法树(abstract syntax tree,AST)的单语言克隆检测已经取得了较为显著的效果,但跨语言代码的AST节点存在同义词、近义词且手工标注数据集成本高等问题,限制了现有克隆检测方法的有效性和实用性。针对上述问题,提出一种基于对比学习的树卷积神经网络(contrastive tree convolutional neuraln etwork,CTCNN)的跨语言代码克隆检测方法。该方法首先将不同编程语言的代码解析为AST,并对AST的节点类型和节点值作同义词转换处理,以降低不同编程语言AST之间的差异;同时,采用对比学习扩充负样本并对模型进行训练,使得在小样本数据集下能够最小化克隆对之间的距离,最大化非克隆对之间的距离。最后在公开数据集上进行了评测,精确度达到95.26%、召回率为99.98%、F_(1)为97.56%。结果表明,相较于现有的最好的CLCDSA和C4方法,该模型的检测精度分别提高了432%和3.73%,其F_(1)值分别提升了29.84%和6.29%,证明了所提模型是一种有效的跨语言代码克隆检测方法。Code clone detection is an important technology to improve software development efficiency,quality,and reliability.Single-language clone detection based on AST has achieved significant performance.However,the existence of synonyms and near-synonyms in AST nodes of cross-language codes and the high cost of manual labeling limit the effectiveness and usefulness of existing clone detection methods.To address these issues,this paper proposed a cross-language code clone detection method based on contrastive tree convolutional neural network(CTCNN).Firstly,it parsed the codes of different programming languages into ASTs,and processed the node types and values of ASTs by synonym conversion to reduce the differences between ASTs in different programming languages.At the same time,it employed contrastive learning to augment negative samples and train the model,so that this approach ensured the minimization of distances between clone pairs and the maximization of distances between non-clone pairs in small sample datasets.Finally,it evaluated the proposed method on a public dataset with precision,recall,and F 1-scores of 95.6%,99.98%,and 97.56%.The results show that compared to the best existing methods CLCDSA and C4,the proposed model improves the detection accuracy by 43.92%and 3.73%,and increases the F_(1)-score by 29.84%and 6.29%,which confirms that the proposed model is an effective cross-language code clone detection method.

关 键 词:跨语言 代码克隆 对比学习 抽象语法树 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象