基于CodeBERT的代码提交分类研究  

Research on commit classification based on CodeBERT

在线阅读下载全文

作  者:李英玲 兰宏富 李苒 黄闽英[2,3] LI Ying-ling;LAN Hong-fu;LI Ran;HUANG Min-ying(School of Computer Science and Engineering,Southwest Minzu University,Chengdu 610041,China;The Key Laboratory for Computer Systems of State Ethnic Affairs Commission,Southwest Minzu University,Chengdu 610041,China;Business School,Southwest Minzu University,Chengdu 610041,China)

机构地区:[1]西南民族大学计算机科学与工程学院,四川成都610041 [2]西南民族大学计算机系统国家民委重点实验室,四川成都610041 [3]西南民族大学商学院,四川成都610041

出  处:《西南民族大学学报(自然科学版)》2023年第2期189-196,共8页Journal of Southwest Minzu University(Natural Science Edition)

基  金:四川省科技厅苗子工程重点项目(2021JDRC0066);西南民族大学科研启动金资助项目(RQD2021096)。

摘  要:理解软件仓库中执行的软件维护活动,有助于确保高效的演化和开发活动.对代码提交(commit)进行准确地分类,能帮助软件管理人员更合理地进行资源分配,从而减少维护成本.然而,已有研究忽视了提交说明中关键词的上下文信息,或者未考虑变更代码的语义信息,导致不准确的提交分类.提出了基于预训练模型CodeBERT的代码提交分类模型(CBEC),该模型首先获取公开数据集中commits的code diff信息,准备提交说明和diff信息对,并进行词元化表示;接着使用CodeBERT模型学习提交说明和diff信息的语义深度表示,同时从多个维度提取提交相关的手工设计特征;最后,融合commit的语义特征和传统手工特征,构建提交分类模型.提出的模型与当前具有代表性的2个方法进行比较,从准确率、精准率和召回率来看,分别高出基线方法5.0%~26.8%、4.9%~27.2%、5.4%~27.3%.能帮助软件从业者更好地理解和识别代码提交的变更意图,有利于提高开发效益.It is beneficial to understand maintenance activities in a source code repository in order to ensure effective software e⁃volution and development activities.Accurately classifying code commits can help software managers allocate resources reasona⁃bly,which can reduce the maintenance costs.However,existing studies only used static keywords in commit messages,and neg⁃lected the context information of these keywords or semantic information of the changed code,which has lead to the incorrect commit classification.In this paper,a classification model(CBEC)of code commits was proposed based on the pre⁃trained mod⁃el CodeBERT.Firstly,this paper extracted the code diff of commits in the open dataset,and tokenized commit messages and code diff into a maximum of tokens.Then,CBEC used CodeBERT to learn the deep semantic representation of commit messages and diff messages,and extracted the hand⁃crafted features related to commits from multiple dimensions.Finally,CBEC combined the semantic feature and hand⁃crafted features,and built a commit classification based CNN network.Compared with two repre⁃sentative approaches,this CBEC was 5.0%~26.8%,4.9%~27.2%,5.4%~27.3%higher than the two baselines in terms of accuracy,preciseness,and recall respectively.Therefore,the research in this paper can help software practitioners understand and identify the change intentions of code commits effectively,which is conducive to improving the benefits of software develop⁃ment.

关 键 词:提交分类 CodeBERT 迁移学习 卷积神经网络 

分 类 号:TP311.53[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象