检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]钦州学院电子与信息工程学院,广西钦州535011 [2]郑州轻工业学院软件学院,河南郑州450000
出 处:《钦州学院学报》2017年第5期27-33,共7页Journal of Qinzhou University
基 金:广西高校中青年教师基础能力提升项目:基于Wikipedia的大规模web文本分类的研究(KY2016LX431)
摘 要:为丰富和更好识别文本的特征以提高分类精度,采用一种新的算法CBAFIS(classifier based ESA and frequent item sets):首先引入基于有着内容丰富、更新速度快特点的维基百科而设计的ESA算法对训练文本中特征与维基中的概念进行语义相关度计算,把相关度最高的若干概念对词袋进行特征扩展;然后以扩展后的文档为事务、文档中的概念为项,构建FP-Tree,利用FP-Growth挖掘不同类别文本的特征频繁项集;最后将频繁项集结合Naive Bayes算法构建一个文本分类器。实验表明:新的方法在进行语义扩展后的正确率、召回率在最优的情况下分别比Native Bayes和SVM算法高出2.7%和2.6%以上,具有更高的精度。As these documents are novel, sparse and in the absence of context information, traditional text categorization based on BOW( Bag of Words) become ineffective in classifying the information. Therefore, CBAFIS( classifier based ESA and frequent item sets)is proposed to enrich and better identify text features so as to improve the accuracy of information classifica- tion. Firstly, ESA is employed to compute Semantic Relativity between text features and concepts in Wikipedia, which consists of millions of up-to-date articles. In this way, the BOW is extended by the concepts which have highest score of Semantic Relativity. Then, a FP-Tree is constructed as each document extended by Wikipedia concepts is treated as a transaction, while each feature as an item. Then the FP-Growth algorithm is used to mine the frequent item sets for each category. A classifier is constructed by combining frequent item sets and Naive Bayes algorithm. Experiments confirmed that this method provides a better result than Native Bayes and SVM with an increase of 2.7% in precision and 2.6% in recalling.
关 键 词:语义相关度 频繁项集 NAIVE BAYES 文本分类
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28