基于语义相关度和频繁项集挖掘的文本分类

Text Categorization Based on Semantic Relativity and Frequent Item-set Mining

机构地区：[1]钦州学院电子与信息工程学院,广西钦州535011 [2]郑州轻工业学院软件学院,河南郑州450000

出　　处：《钦州学院学报》2017年第5期27-33,共7页Journal of Qinzhou University

基　　金：广西高校中青年教师基础能力提升项目:基于Wikipedia的大规模web文本分类的研究(KY2016LX431)

摘　　要：为丰富和更好识别文本的特征以提高分类精度,采用一种新的算法CBAFIS(classifier based ESA and frequent item sets):首先引入基于有着内容丰富、更新速度快特点的维基百科而设计的ESA算法对训练文本中特征与维基中的概念进行语义相关度计算,把相关度最高的若干概念对词袋进行特征扩展;然后以扩展后的文档为事务、文档中的概念为项,构建FP-Tree,利用FP-Growth挖掘不同类别文本的特征频繁项集;最后将频繁项集结合Naive Bayes算法构建一个文本分类器。实验表明:新的方法在进行语义扩展后的正确率、召回率在最优的情况下分别比Native Bayes和SVM算法高出2.7%和2.6%以上,具有更高的精度。As these documents are novel, sparse and in the absence of context information, traditional text categorization based on BOW（ Bag of Words） become ineffective in classifying the information. Therefore, CBAFIS（ classifier based ESA and frequent item sets）is proposed to enrich and better identify text features so as to improve the accuracy of information classifica- tion. Firstly, ESA is employed to compute Semantic Relativity between text features and concepts in Wikipedia, which consists of millions of up-to-date articles. In this way, the BOW is extended by the concepts which have highest score of Semantic Relativity. Then, a FP-Tree is constructed as each document extended by Wikipedia concepts is treated as a transaction, while each feature as an item. Then the FP-Growth algorithm is used to mine the frequent item sets for each category. A classifier is constructed by combining frequent item sets and Naive Bayes algorithm. Experiments confirmed that this method provides a better result than Native Bayes and SVM with an increase of 2.7% in precision and 2.6% in recalling.

关键词：语义相关度频繁项集 NAIVE BAYES 文本分类

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于语义相关度和频繁项集挖掘的文本分类

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于语义相关度和频繁项集挖掘的文本分类

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索