基于word2vec和LSTM的句子相似度计算及其在水稻FAQ问答系统中的应用  被引量:19

Sentence similarity computing based on word2vec and LSTM and its application in rice FAQ question-answering system

在线阅读下载全文

作  者:梁敬东[1] 崔丙剑 姜海燕[1,2] 沈毅[1] 谢元澄[1] LIANG Jingdong;CUI Bingjian;JIANG Haiyan;SHEN Yi;XIE Yuancheng(College of Information Science and Technology,Nanjing Agricultural University,Nanjing 210095,China;National Engineering and Technology Center for Information Agriculture,Nanjing Agricultural University,Nanjing 210095,China)

机构地区:[1]南京农业大学信息科学技术学院,江苏南京210095 [2]南京农业大学国家信息农业工程技术中心,江苏南京210095

出  处:《南京农业大学学报》2018年第5期946-953,共8页Journal of Nanjing Agricultural University

基  金:国家重点研发计划项目(2016YFD0300607);中央高校基本科研业务费自主创新重点项目(KYZ201550;KYZ201548)

摘  要:[目的]水稻FAQ(frequently asked question,常问问题集)问答系统对农户在水稻种植过程中遇到的问题进行解答,问句相似度计算是其核心,用来匹配用户问题和FAQ中的问题。针对传统句子相似度算法准确率普遍较低的问题,本研究旨在用深度学习计算问句相似度,以提高系统回答的准确性。[方法]构建一个基于word2vec和LSTM(long-short term memory,长短期记忆)神经网络,包括输入层、嵌入层、LSTM层、全连接层和输出层的句子相似度模型。对水稻FAQ中的3 007个问题进行归类和组合得到32 072个问题对,并标注其相似性作为训练和测试数据。使用基于农业领域语料库训练得到的word2vec模型对训练数据向量化后作为输入,训练句子相似度模型。[结果]在测试集上对模型进行验证,并与基于How Net、基于词向量的余弦距离以及基于word2vec和卷积神经网络(convolutional neural network,CNN)的3种句子相似度算法进行对比。对句子相似度的计算结果进行抽样检查,该模型的计算结果更符合人的直观印象。从准确率和ROC(receiver operating characteristic curve)曲线进行分析,该模型也明显优于其他3种方法,准确率达到了93.1%。[结论]本研究构建的模型显著提升了句子相似度计算的准确率,基于该模型开发的水稻FAQ问答系统,能够准确匹配用户问题和水稻FAQ中的问题,帮助农户更好地解决水稻生产中遇到的问题。[Objectives]Rice FAQ(frequently asked question)question-answering system answers questions that farmers encounter in the process of rice planting,and the core of the system is question similarity computing,which is used to match users’questions and the questions in FAQ.In order to solve the problem of low accuracy of the traditional sentence similarity algorithms,this study aims to use deep learning to calculate the similarity of questions to improve the accuracy of the system.[Methods]Based on word2vec and LSTM(long-short term memory),a sentence similarity computing model was designed including input layer,embedding layer,LSTM layer,full connection layer and output layer.Then 32 072 question pairs were obtained through manually grouping 3 007 questions in rice FAQ into pairs,and their similarities were marked as training dataset and test dataset.Using the word2vec model trained in the agricultural field corpus,the training dataset was mapped into vectors and used as input to train the sentence similarity computing model.[Results]Finally,the model was validated on the test dataset and compared with the other three sentence similarity methods:the method based on HowNet,the method based on cosine distance of word vectors,and the method based on word2vec and CNN(convolutional neural network).Sampling results of the sentence similarity calculation indicated that the result of this model was more reasonable for human.Furthermore,the analysis results of the accuracy and ROC(receiver operating characteristic curve)curves showed that our model was obviously superior to the other three methods,and the accuracy was 93.1%.[Conclusions]The model designed in this study has significantly increased the accuracy of sentence similarity computation.The rice FAQ question-answering system developed by this model can accurately match users’questions and the questions in rice FAQ,and better help farmers solve problems in rice production.

关 键 词:水稻 问答系统 常问问题集 词向量 长短期记忆 深度学习 

分 类 号:S126[农业科学—农业基础科学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象