检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:冷林珊 刘爽[2] 田承霖 窦淑洁 王赞[1,2] 张梅山 LENG Lin-Shan;LIU Shuang;TIAN Cheng-Lin;DOU Shu-Jie;WANG Zan;ZHANG Mei-Shan(School of New Media and Communication,Tianjin University,Tianjin 300072,China;College of Intelligence and Computing,Tianjin University,Tianjin 300350,China)
机构地区:[1]天津大学新媒体与传播学院,天津300072 [2]天津大学智能与计算学部,天津300350
出 处:《软件学报》2022年第5期1758-1773,共16页Journal of Software
基 金:国家自然科学基金(U1836214,61802275);天津大学自主创新基金(2020XRG-0022)。
摘 要:代码克隆检测是软件工程领域的一项重要任务,对于语义相似但语法差距较大的四型代码克隆的检测尤为困难.基于深度学习的方法在四型代码克隆的检测上已经取得了较好的效果,但是使用人工标注的代码克隆对进行监督学习的成本较高.提出了两种简单有效的预训练策略来增强基于深度学习的代码克隆检测模型的代码表示,以减少监督学习模型中对大规模训练数据集的需求.首先,使用ngram子词丰富对词嵌入模型进行预训练,以增强克隆检测模型对词表之外的词的表示.同时,采用函数名预测作为辅助任务对克隆检测模型参数进行预训练.通过这两个预训练策略,可以得到一个有更准确的代码表示能力的模型,模型被用来作为克隆检测中的代码表示模型并在克隆检测任务上进行有监督训练.在标准数据集BigCloneBench(BCB)和OJClone上进行实验.结果表明采用两种预训练增强的模型仅仅使用极少量的训练样例(BCB上100个克隆对和100个非克隆对,OJClone上200个克隆对和200个非克隆对)就能达到现有方法使用超过6百万个训练样例得到的结果.Code clone detection is an important task in the software engineering community,it is particularly difficult to detect type-IV code clone,which have similar semantics but large syntax gap.Deep learning-based approaches have achieved promising performances on the detection of type-IV code clone,yet at the high-cost of using manually-annotated code clone pairs for supervision.This study proposes two simple and effective pretraining strategies to enhance the representation learning of code clone detection model based on deep learning,aiming to alleviate the requirement of the large-scale training dataset in supervised learning models.First,token embeddings models are pretrained with ngram subword enhancement,which helps the clone detection model to better represent out-of-vocabulary(OOV)tokens.Second,the function name prediction is adopted as an auxiliary task to pretrain clone detection model parameters from token to code fragments.With the two enhancement strategies,a model with more accurate code representation capability can be achieved,which is then used as the code representation model in clone detection and trained on the clone detection task with supervised learning.The experiments on the standard benchmark dataset BigCloneBench(BCB)and OJClone are conducedt,finding that the final model with only a very small number of training instances(i.e.,100 clones and 100 non-clones for BCB,200 clones and 200 non-clones for OJClone)can give comparable performance than existing methods with over six million training instances.
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.118.226.34