基于word embedding的短文本特征扩展与分类  被引量:8

Short Text Expansion and Classification Based on Word Embedding

在线阅读下载全文

作  者:孟欣[1] 左万利[2] 

机构地区:[1]吉林大学计算机科学与技术学院,长春130012 [2]吉林大学符号计算与知识工程教育部重点实验室,长春130012

出  处:《小型微型计算机系统》2017年第8期1712-1717,共6页Journal of Chinese Computer Systems

基  金:国家自然科学基金项目(60973040)资助;吉林省重点科技攻关基金项目(20130206051GX)资助

摘  要:近几年短文本的大量涌现,给传统的自动文本分类技术带来了挑战.针对短文本特征稀疏、特征覆盖率低等特点,提出了一种基于word embedding扩展短文本特征的分类方法.word embedding是一种词的分布式表示,表示形式为低维连续的向量形式,并且好的word embedding训练模型可以编码很多语言规则和语言模式.本文利用word embedding空间分布特点和其蕴含的线性规则提出了一种新的文本特征扩展方法.结合扩展特征我们分别在谷歌搜索片段、中国日报新闻摘要两类数据集上进行了短文本分类实验,对比于仅使用词袋表示文本特征的分类方法,准确率分别提高:8.59%,7.42%.In recent years,the rapidly growing short text bring a challenge to automatic text classification technology. In this paper, a new short text features extension method based on word embedding is proposed to overcome the sparse and low feature coverage of short text feature. First, training model on universal database to obtain the word dictionary and its corresponding distributed vector representation (word embedding). Then, according to the characteristic of word embedding spatial distribution, clustering word embedding sets which mapped by sample's word sets into different semantic units. Since the word embedding encodes many linguistic regularities and patterns, we obtain more semantic information as inference features by simple calculation between word embedding. Finally, mapping word embedding and inference features to different semantic units as extension features. Combined with extension features, we did short text classification experiments on two data sets:Google snippets and China Daily News Digest. Compared with the traditional method only using bag of words ,the accuracy rate was increased by 8.59% and 7.42%.

关 键 词:WORD EMBEDDING 文本特征 语义推理 短文本分类 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象