基于类别分布的特征选择框架被引量：18

Category Distribution-Based Feature Selection Framework

机构地区：[1]中国科学院计算技术研究所,北京100190 [2]中国科学院研究生院,北京100049 [3]北京大学软件与微电子学院,北京102600 [4]北京语言大学网络信息与教育技术中心,北京100083

出　　处：《计算机研究与发展》2009年第9期1586-1593,共8页Journal of Computer Research and Development

基　　金：国家"九七三"重点基础研究发展计划基金项目(2007CB311103);国家自然科学基金项目(60873166;60603094);国家"八六三"高技术研究发展计划基金项目(2006AA010105)~~

摘　　要：目前已有很多种特征选择方法,但就目前所知,没有一种方法能够在非平衡语料上取得很好的效果.依据特征在类别间的分布特点提出了基于类别分布的特征选择框架.该框架能够利用特征的分布信息选出具有较强区分能力的特征,同时允许给类别灵活地分配权重,分配较大的权重给稀有类别则提高稀有类别的分类效果,所以它适用于非平衡语料,也具有很好的扩展性.另外,OCFS和基于类别分布差异的特征过滤可以看作该框架的特例.实现该框架得到了具体的特征选择方法,Retuers-21578语料及复旦大学语料等两个非平衡语料上的实验表明,它们的Macro和Micro F1效果都优于IG,CHI和OCFS.Text categorization is an important technique in data mining domain. Extremely high dimension of features makes text categorization processing complex and expensive, and thus effective dimension reduction methods are extraordinarily desired. Feature selection is widely used to reduce dimension. Many feature selection methods have been proposed in recent years. But to the authors＇ best knowledge, there is no method that performs very well on unbalanced datasets. This paper proposes a feature selection framework based on the category distribution difference of features named category distribution-based feature selection （CDFS）. This approach selects features that have strong discriminative power using distribution information of features. At the same time, weights can be flexibly assigned to categories. If larger weights are assigned to rare categories, the performance on rare categories can be improved. So this framework is suitable for unbalanced data and highly extensible. Besides, OCFS and feature filter based on category distribution difference can be viewed as special cases of this framework. A number of implementations of CDFS are given. The experimental results on Reuters-21578 corpus and Fudan corpus （unbalanced datasets） show that both MacroF1 and MicroF1 by implementations of CDFS given in this paper are better than those by IG, CHI and OCFS.

关键词：特征选择非平衡语料特征降维文本分类数据挖掘

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于类别分布的特征选择框架被引量：18

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于类别分布的特征选择框架 被引量：18

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于类别分布的特征选择框架被引量：18