结合新型文档频和二进制可辨矩阵的特征选择  被引量:3

Feature selection combining new document frequency with binary discernibility matrix

在线阅读下载全文

作  者:马春华[1] 朱颢东[2,3] 钟勇[2,3] 

机构地区:[1]绥化学院计算机科学与技术系,黑龙江绥化152061 [2]中国科学院成都计算机应用研究所,成都610041 [3]中国科学院研究生院,北京100039

出  处:《计算机应用》2009年第8期2268-2271,共4页journal of Computer Applications

基  金:四川省科技计划项目(2008GZ0003)

摘  要:特征选择是文本分类的一个核心研究课题。分析了几种经典特征选择方法并总结了它们的不足,提出了一个新型文档频,引入粗糙集理论,并给出了一个基于二进制可辨矩阵的属性约简算法,最后把该属性约简算法同新型文档频结合起来,提供了一个综合的特征选择方法。该方法首先利用新型文档频进行特征初选以过滤掉一些词条,然后利用所提属性约简算法消除冗余。通过对人民网的8类新闻组,每类300篇文档的分类实验,结果表明此种特征选择方法在分类准确率和召回率上优于互信息、CHI和信息增益方法。Feature selection is a core research topic in text categorization. Several classic feature selection methods were analyzed and their deficiencies were summarized. A new document frequency was proposed, and Rough Set (RS) theory was adopted to provide an attribute reduction algorithm based on binary discernibility matrix. Based on the attribute reduction algorithm and the new document frequency, a comprehensive feature selection method was given. The comprehensive method firstly used the new document frequency to select features to filter out some terms, and then employed the attribute reduction algorithm to eliminate redundancy. The experimental results on data of 8 classes, 300 documents each class from http://www. people, com. cn show that the comprehensive method has higher accuracy and recall rate compared with Mutual Information (MI), CHI value and Information Gain (IG) methods.

关 键 词:特征选择 文本分类 文档频 二进制可辨矩阵 粗糙集 属性约简 

分 类 号:TP391.4[自动化与计算机技术—计算机应用技术] TP391.12[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象