Short text classification based on strong feature thesaurus 被引量：7

Short text classification based on strong feature thesaurus

作　　者：Bing-kun WANG Yong-feng HUANG Wan-xia YANG Xing LI

机构地区：[1]Information Cognitive and Intelligent System Research Institute,Department of Electronic and Engineering,Tsinghua University,Beijing 100084,China [2]Information Technology National Laboratory,Tsinghua University,Beijing 100084,China

出　　处：《Journal of Zhejiang University-Science C(Computers and Electronics)》2012年第9期649-659,共11页浙江大学学报C辑（计算机与电子（英文版）

基　　金：Project (No. 20111081023) supported by the Tsinghua University Initiative Scientific Research Program, China

摘　　要：Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low ac- curacy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus （SFT） based on latent Dirichlet allocation （LDA） and information gain （IG） models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine （SVM） and Naive Bayes Multinomial.Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low accuracy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Nave Bayes Multinomial.

关键词：Short text CLASSIFICATION Data sparseness SEMANTIC Strong feature thesaurus （SFT） Latent Dirichlet allocation（LDA）

分类号：TP391[自动化与计算机技术—计算机应用技术] TP391.14[自动化与计算机技术—计算机科学与技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Short text classification based on strong feature thesaurus 被引量：7

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Short text classification based on strong feature thesaurus 被引量：7

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索