一种基于属性加权补集的朴素贝叶斯文本分类算法  被引量:14

An Improved Naive Bayesian Text Classification Algorithm based on Weighted Features and its Complementary Set

在线阅读下载全文

作  者:陈凯 黄英来[1] 高文韬 赵鹏[1] CHEN Kai;HUANG Ying-lai;GAO Wen-tao;ZHAO Peng(Information and Computer Engineering College,Northeast Forestry University,Harbin 150040,China;Harbin Metro Group Co.,Ltd.,Harbin 150000,China)

机构地区:[1]东北林业大学信息与计算机工程学院,黑龙江哈尔滨150040 [2]哈尔滨地铁集团有限公司,黑龙江哈尔滨150000

出  处:《哈尔滨理工大学学报》2018年第4期69-74,共6页Journal of Harbin University of Science and Technology

基  金:新世纪优秀人才基金(NCET-12-0809);国家自然科学基金(31670717)

摘  要:针对文本训练集中各个类别的样本分布不均衡时,少数类别的特征会被多数类别的特征淹没的问题,提出一种属性加权补集的朴素贝叶斯文本分类算法,该算法使用属性加权改进补集朴素贝叶斯算法,使用TF-IDF算法计算特征词在当前文档中的权重;利用当前类别补集的特征表示当前类别的特征并结合特征词在文档中的权重,解决分类器容易倾向大类别而忽略小类别的问题。与传统的朴素贝叶斯及补集朴素贝叶斯算法进行对比实验,结果表明:在样本集分布不均衡时,改进算法的性能表现最优,分类准确率、召回率及G-mean性能分别可达82.92%、84.6%、88.76%。When training samples of each class are distributed unevenly and sparsely,the features of smaller class cannot be adequately expressed and submerged by lager class,to solve this problem,a new method TFWCNB(TF-IDF weighted complementary Na ve Bayes)algorithm was proposed for unbalanced problem.TFWCNB used weighted features to improve the complement na ve Bayes and TF-IDF algorithm to calculate the feature word’s weight in the current document;in additional,it used features of current class’s complementary set to represent the features of current class,combining the feature word’s weight,it can solve the problem that the classifier tends to larger class and ignores the smaller class.The experimental results comparing with the traditional Na ve Bayes and the complement Na ve Bayes show that the TFWCNB algorithm has the best performance when the sample set is unevenly distributed,its classification precision,recall and g-mean value can relatively reach 82.92%,84.6%and 88.76%.

关 键 词:属性加权 文本分类 朴素贝叶斯 不均衡数据集 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象