基于Huffman-LDA和Weight-Word2vec的文本表示模型研究  被引量:4

Text Representation Model Based on Huffman-LDA and Weight-Word2vec

在线阅读下载全文

作  者:黄春雨[1] 胡迪 邱宁佳[1] 孙爽滋[1] HUANG Chun-yu;HU Di;QIU Ning-jia;SUN Shuang-zi(School of Computer Science and Technology,Changchun University of Science and Technology,Changchun 130022)

机构地区:[1]长春理工大学计算机科学技术学院,长春130022

出  处:《长春理工大学学报(自然科学版)》2020年第1期89-96,132,共9页Journal of Changchun University of Science and Technology(Natural Science Edition)

基  金:吉林省重大科技招标项目(20170203004GX)。

摘  要:LDA是对主题到文档的全局结构建模,但其特征中缺少文档内部的局部词之间的关系,只能获得稀疏特征。Word2vec是一种基于上下文预测目标词的词嵌入模型,然而,基于这种方法只能以局部信息表示文档特征,缺乏全局信息。LDA和Word2vec的文本表示模型是基于主题向量和文档向量计算新的特征表示文本,但直接计算所得的稀疏主题特征与基于词向量的文档特征的距离,缺乏特征的一致性。本文提出了Huffman-LDA和Weight-Word2vec的文本表示模型,首先,使用LDA模型得到主题向量后构建主题哈夫曼树,再运用梯度上升方法更新主题向量,新的主题向量包含不同主题词之间的关系,求得的特征不再具有稀疏性;然后,使用LDA主题向量与主题矩阵中词的主题特性计算词权重更新Word2vec的词向量,使得词向量包含主题词之间的关系进而表示文档向量;最后,通过主题向量和文档向量的欧式距离得到具有强分类特征的文本表示。实验结果表明,该方法可获得更强的文本表示特征,有效提高文档分类精度。LDA is to model the global structure of theme-to-document;but its features lack the relationship between the local words within the document;so only sparse features can be obtained.Word2vec is a word embedding model based on context prediction of target words.However,based on this method,document features can only be represented by local information,lacking global information.The mixed model of LDA and Word2vec is to calculate the new feature representation text based on topic vector and document vector,but the distance between the sparse theme feature is directly calculated and the document feature based on word vector is not consistent with the feature.In this paper,the text representation model of Huffman-LDA and Weight-Word2vec algorithm is proposed.Firstly,the topic huffman tree is constructed after the topic vector is obtained by using LDA model;and then the topic vector is updated by using gradient rise method.The new topic vector contains the relationship between different subject words,and the obtained feature is no longer sparse.Then,the LDA topic vector and the topic property of words in the topic matrix are used to calculate the word weight and update the word vector of Word2vec;so that the word vector contains the relationship between the subject words and then represents the document vector.Finally,the text representation with strong classification features is obtained through the Euclidean distance of subject vector and document vector.Experimental results show that the proposed method can obtain stronger text representation features and improve the accuracy of document classification.

关 键 词:主题模型 词嵌入 文本表示 Huffman-LDA Weight-Word2vec 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象