预训练增强的代码克隆检测技术  被引量:2

Clone Detection with Pre-training Enhanced Code Representation

在线阅读下载全文

作  者:冷林珊 刘爽[2] 田承霖 窦淑洁 王赞[1,2] 张梅山 LENG Lin-Shan;LIU Shuang;TIAN Cheng-Lin;DOU Shu-Jie;WANG Zan;ZHANG Mei-Shan(School of New Media and Communication,Tianjin University,Tianjin 300072,China;College of Intelligence and Computing,Tianjin University,Tianjin 300350,China)

机构地区:[1]天津大学新媒体与传播学院,天津300072 [2]天津大学智能与计算学部,天津300350

出  处:《软件学报》2022年第5期1758-1773,共16页Journal of Software

基  金:国家自然科学基金(U1836214,61802275);天津大学自主创新基金(2020XRG-0022)。

摘  要:代码克隆检测是软件工程领域的一项重要任务,对于语义相似但语法差距较大的四型代码克隆的检测尤为困难.基于深度学习的方法在四型代码克隆的检测上已经取得了较好的效果,但是使用人工标注的代码克隆对进行监督学习的成本较高.提出了两种简单有效的预训练策略来增强基于深度学习的代码克隆检测模型的代码表示,以减少监督学习模型中对大规模训练数据集的需求.首先,使用ngram子词丰富对词嵌入模型进行预训练,以增强克隆检测模型对词表之外的词的表示.同时,采用函数名预测作为辅助任务对克隆检测模型参数进行预训练.通过这两个预训练策略,可以得到一个有更准确的代码表示能力的模型,模型被用来作为克隆检测中的代码表示模型并在克隆检测任务上进行有监督训练.在标准数据集BigCloneBench(BCB)和OJClone上进行实验.结果表明采用两种预训练增强的模型仅仅使用极少量的训练样例(BCB上100个克隆对和100个非克隆对,OJClone上200个克隆对和200个非克隆对)就能达到现有方法使用超过6百万个训练样例得到的结果.Code clone detection is an important task in the software engineering community,it is particularly difficult to detect type-IV code clone,which have similar semantics but large syntax gap.Deep learning-based approaches have achieved promising performances on the detection of type-IV code clone,yet at the high-cost of using manually-annotated code clone pairs for supervision.This study proposes two simple and effective pretraining strategies to enhance the representation learning of code clone detection model based on deep learning,aiming to alleviate the requirement of the large-scale training dataset in supervised learning models.First,token embeddings models are pretrained with ngram subword enhancement,which helps the clone detection model to better represent out-of-vocabulary(OOV)tokens.Second,the function name prediction is adopted as an auxiliary task to pretrain clone detection model parameters from token to code fragments.With the two enhancement strategies,a model with more accurate code representation capability can be achieved,which is then used as the code representation model in clone detection and trained on the clone detection task with supervised learning.The experiments on the standard benchmark dataset BigCloneBench(BCB)and OJClone are conducedt,finding that the final model with only a very small number of training instances(i.e.,100 clones and 100 non-clones for BCB,200 clones and 200 non-clones for OJClone)can give comparable performance than existing methods with over six million training instances.

关 键 词:代码克隆 预训练 LSTM 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象