改进的神经语言模型及其在代码提示中的应用  

Modified Neural Language Model and Its Application in Code Suggestion

在线阅读下载全文

作  者:张献 贲可荣 ZHANG Xian;BEN Ke-rong(School of Electronic Engineering,Naval University of Engineering,Wuhan 430033,China)

机构地区:[1]海军工程大学电子工程学院

出  处:《计算机科学》2019年第11期168-175,共8页Computer Science

基  金:国家安全重大基础研究计划项目(613315)资助

摘  要:语言模型旨在刻画文本段的发生概率,作为自然语言处理领域中的一类重要模型,近年来其被广泛应用于不同软件分析任务,例如代码提示。为提高模型对代码特征的学习能力,文中提出了一种改进的循环神经网络语言模型——CodeNLM。该模型通过分析词向量形式表示的源代码序列,能够捕获代码规律,实现对序列联合概率分布的估计。考虑到现有模型仅学习代码数据,信息的利用不充分,提出了附加信息引导策略,通过非代码信息的辅助来提高代码规律的刻画能力。针对语言建模任务的特点,提出了节点逐层递增策略,通过优化网络结构来改善信息传递的有效性。实验中,针对9个Java项目共203万行代码,CodeNLM得到的困惑度指标明显优于n-gram类模型和传统神经语言模型,在代码提示应用中得到的平均准确度(MRR指标)较对比方法提高了3.4%~24.4%。实验结果表明,CodeNLM能有效地实现程序语言建模和代码提示任务,并具有较强的长距离信息学习能力。Language models are designed to characterize the occurrence probabilities of text segments.As a class of important model in the field of natural language processing,it has been widely used in different software analysis tasks in recent years.To enhance the learning ability for code features,this paper proposed a modified recurrent neural network language model,called CodeNLM.By analyzing the source code sequences represented in embedding form,the model can capture rules in codes and realize the estimation of the joint probability distribution of the sequences.Considering that the existing models only learn the code data and the information is not fully utilized,this paper proposed an additional information guidance strategy,which can improve the ability of characterizing the code rules through the assistance of non-code information.Aiming at the characteristics of language modeling task,a layer-by-layer incremental nodes setting strategy is proposed,which can optimize the network structure and improve the effectiveness of information transmission.In the verification experiments,for 9 Java projects with 2.03M lines of code,the perplexity index of CodeNLM is obviously better than the contrast n-gram class models and neural language models.In the code suggestion task,the average accuracy(MRR index)of the proposed model is 3.4%~24.4%higher than the contrast methods.The experimental results show that except possessing a strong long-distance information learning capability,CodeNLM can effectively model programming language and perform code suggestion well.

关 键 词:软件分析 代码提示 自然语言处理 语言模型 循环神经网络 

分 类 号:TP311.5[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象