融合BTM主题特征的短文本分类方法被引量：11

Improved short text classification method based on BTM topic features

机构地区：[1]安徽大学计算机科学与技术学院,合肥230601 [2]计算智能和信号处理教育部重点实验室,合肥230601

出　　处：《计算机工程与应用》2016年第13期95-100,共6页Computer Engineering and Applications

基　　金：安徽省高校自然科学研究重点项目(No.KJ2013A020);安徽省自然科学基金(No.11040606M133)

摘　　要：针对短文本特征较少而导致使用传统文本分类算法进行分类效果并不理想的问题,提出了一种融合BTM主题特征和改进了特征权重计算的综合特征提取方法来进行短文本分类。方法中,在TF-IWF的基础上降低词频权重并引入词分布熵,衍生出新的算法计算权重。结合BTM主题模型中各主题下的主题词对词数较少的文档进行补充,并选择每篇文档在各个主题下的概率分布作为另一部分文档特征。通过KNN算法进行多组分类实验,结果证明该方法与传统的TF-IWF等方法计算特征进行比较,F1的结果提高了10%左右,验证了方法的有效性。Short texts are normally featured with less content, looser text format, varied sentence length and relativelycomplex sentence structure. Consequently, the effects of traditional classification algorithms are quite unsatisfactory. Thispaper presents an authentic comprehensive method by the fusion of BTM theme features and well-improved weight calculationmethod for short text classification. In order to achieve this, two steps are in necessity. Firstly, the paper reduces theterm frequency weight according to the TF-IWF. In the meantime, it introduces the word distribution probability value sothat a new algorithm for computing weights will derive. Secondly, it uses the topic words of BTM topic model to complementempty documents. Meanwhile, the probability distribution of each document in each topic will be carefully selectedas the document’s other features. Experimental results indicate that with the help of this newly created method, the resultsof F1 has been improved by around 10% compared to the original TF-IWF method.

关键词：短文本权重计算 TF-IWF方法主题模型

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合BTM主题特征的短文本分类方法被引量：11

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合BTM主题特征的短文本分类方法 被引量：11

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

融合BTM主题特征的短文本分类方法被引量：11