一种基于频繁词集的短文本特征扩展方法被引量：15

Short text feature extension method based on frequent term sets

机构地区：[1]北京航空航天大学计算机学院,北京 100191 [2]北京航空航天大学深圳研究院,深圳518000

出　　处：《东南大学学报（自然科学版）》2014年第2期256-260,共5页Journal of Southeast University：Natural Science Edition

基　　金：国家自然科学基金资助项目(61103095);国家国际科技合作专项资助项目(2010DFB13350);国家高技术研究发展计划(863计划)资助项目(2011AA010502);中央高校基本科研业务费专项资金资助项目

摘　　要：为了解决向量空间模型(VSM)对短文本内容表示能力不足的问题,提出了一种基于频繁词集的特征扩展方法.定义了单词间的共现关系和类别同向关系,通过计算单词集的支持度和置信度,挖掘出具有相同类别倾向的频繁词集,并将其作为短文本特征扩展的背景知识库.对于短文本中的每个原始单词,从背景知识库中查找包含有该单词的频繁词集,将其作为扩展特征加入原特征向量中.搜狗语料集上的实验结果表明,置信度和支持度对背景知识库的规模有较大的影响,但是扩展过多的特征存在冗余性,对分类效果没有进一步的提升.基于频繁词集构建的短文本背景知识库可以作为有效的扩展特征;当训练文本数较为有限时,特征扩展对支持向量机SVM的分类效果有显著的提升.A short text feature extension method based on frequent term sets is proposed to overcome the drawbacks of the vector space model （VSM）on representing short text content.After defining the co-occurring and class orientation relations between terms,frequent term sets with identical class orientation are generated by calculating the support and confidence of word sets,and then are taken as the background knowledge for short text feature extension.For each single term of the short text, the term sets containing this term are found in the background knowledge and added into the original term vector as the feature extension.The experimental results on Sougou corpus show that the sup-port and confidence have great impact on the scale of the background knowledge,but excessive ex-tension also has redundancy and cannot obtain further improvement.The background knowledge based on frequent term sets is an effective way for feature extension.When the number of the train-ing documents is limited,these extended features can greatly improve the classification results of the support vector mechine（SVM）.

关键词：频繁项目集短文本分类特征扩展

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于频繁词集的短文本特征扩展方法被引量：15

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于频繁词集的短文本特征扩展方法 被引量：15

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种基于频繁词集的短文本特征扩展方法被引量：15