基于word2vec和LSTM的饮食健康文本分类研究  被引量:43

Diet Health Text Classification Based on word2vec and LSTM

在线阅读下载全文

作  者:赵明[1] 杜会芳[1] 董翠翠 陈长松[2] ZHAO Ming DU Huifang DONG Cuicui CHEN Changsong(College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China The Third Research Institute, Ministry of Public Security, Shanghai 200031 , China)

机构地区:[1]中国农业大学信息与电气工程学院,北京100083 [2]公安部第三研究所,上海200031

出  处:《农业机械学报》2017年第10期202-208,共7页Transactions of the Chinese Society for Agricultural Machinery

基  金:信息网络安全公安部重点实验室开放课题项目(61503386)

摘  要:为了对饮食文本信息高效分类,建立一种基于word2vec和长短期记忆网络(Long-short term memory,LSTM)的分类模型。针对食物百科和饮食健康文本特点,首先利用word2vec实现包含语义信息的词向量表示,并解决了传统方法导致数据表示稀疏及维度灾难问题,基于K-means++根据语义关系聚类以提高训练数据质量。由word2vec构建文本向量作为LSTM的初始输入,训练LSTM分类模型,自动提取特征,进行饮食宜、忌的文本分类。实验采用48 000个文档进行测试,结果显示,分类准确率为98.08%,高于利用tf-idf、bag-of-words等文本数值化表示方法以及基于支持向量机(Support vector machine,SVM)和卷积神经网络(Convolutional neural network,CNN)分类算法结果。实验结果表明,利用该方法能够高质量地对饮食文本自动分类,帮助人们有效地利用健康饮食信息。The development of Internet information age makes Internet information grow rapidly. As the main information form of the network, the texts are massive, so is texts information about diet. The diet information is closely related with people' s health. It is important to make texts be auto-classified to help people make effective use of health eating information. In order to classify the food text information efficiently, a classification model was proposed based on word2vee and LSTM. According to the characteristics of food text information in encyclopedia and diet texts in health websites, word2vec realized word embedding, including semantic information which solved the problem of sparse representation and dimension disaster that the traditional method faced. Word2vee combined with K-means + + was used to cluster key words both of the proper and the avoiding to enlarge relevant words in classification dictionaries. The words were employed to work out rules to improve the quality of training data. Then document vectors were constructed based on word2vee as the initial input values of long-short term memory network (LSTM). LSTM moved input layer, hidden layers of the neural network into the memory cell to be protected. Through the "gate" structure, sigmoid function and tanh function to remove or increase the information to the cell state which enabled LSTM model the "memory" to make good use of the text context information, which was significant for text classification. Experiments were performed with 48 000 documents. The results showed that the classification accuracy was 98.08%. The result was higher than that of ways based on tf-idf and bag-of-words text vectors representation methods. Two other classification algorithms of support vector machine (SVM) and convolutional neural network (CNN) were also conducted. Both of them were based on word2vee. The results showed that the proposed model outperformed other competing methods by several percentage points. It proved that the method can

关 键 词:文本分类 word2vec 词向量 长短期记忆网络 K-means++ 

分 类 号:TP182[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象