检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:徐燕[1] 李锦涛[1] 王斌[1] 孙春明[1] 张森[1]
机构地区:[1]中国科学院计算技术研究所
出 处:《计算机研究与发展》2008年第4期596-602,共7页Journal of Computer Research and Development
基 金:国家自然科学基金项目(60473002,60603094);北京自然科学基金项目(4051004)
摘 要:特征选择在文本分类中起重要的作用.文档频率(DF)、信息增益(IG)和互信息(MI)等特征选择方法在文本分类中广泛应用.已有的实验结果表明,IG是最有效的特征选择算法之一,DF稍差而MI效果相对较差.在文本分类中,现有的特征选择函数性能的评估均是通过实验验证的方法,即完全是基于经验的方法,为此提出了一种定性地评估特征选择函数性能的方法,并且定义了一组与分类信息相关的基本的约束条件.分析和实验表明,IG完全满足该约束条件,DF不能完全满足,MI和该约束相冲突,即一个特征选择算法的性能在实验中的表现与它是否满足这些约束条件是紧密相关的.Text categorization (TC) is the process of grouping texts into one or more predefined categories based on their content. Due to the increased availability of documents in digital form and the rapid growth of online information, TC has become a key technique for handling and organizing text data. One of the most important issues in TC is feature selection (FS). Many FS methods have been put forward and widely used in the TC field, such as information gain (IG), document frequency thresholding (DF) and mutual information. Empirical studies show that some of these (e.g. IG, DF) produce better categorization performance than others (e.g. MI) . A basic research question is why these FS methods cause different performance. Many existing works seek to answer this question based on empirical studies. In this paper, a theoretical performance evaluation function for FS methods is put forward in text categorization, Some basic desirable constraints that any reasonable FS function should satisfy are defind and then these constraints on some popular FS methods are checked, including IG, DF and MI. It is found that IG satisfies these constraints, and that there are strong statistical correlations between DF and the constraints, whilst MI does not satisfy the constraints. Experimental results on Reuters 21578 and OHSUMED corpora show that the empirical performance of a feature selection method is tightly related to how well it satisfies these constraints.
分 类 号:TP391.3[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117