文本分类中互信息特征选择方法的研究与算法改进  被引量:15

Study and improvement of mutual information for feature selection in text categorization

在线阅读下载全文

作  者:辛竹[1] 周亚建[1] 

机构地区:[1]北京邮电大学信息安全中心,北京100876

出  处:《计算机应用》2013年第A02期116-118,152,共4页journal of Computer Applications

基  金:国家自然科学基金资助项目(60972077)

摘  要:在深入研究传统互信息特征选择方法的基础上,详细分析了该算法分类精确度不高的原因。针对传统互信息算法中的负相关现象以及倾向于选择低频特征词的问题,提出一种基于互信息的特征优化选择方法。该方法在综合考虑频度、集中度、分散度等因素的基础上,通过引入三个调整参数,有效地保证了负相关特征在文本分类中不可忽视的作用,并且提高了高频词汇的选择比重。实验表明,改进的方法可以有效地提高文本分类精度,并且具有更好的稳定性。In text classification, Mutual Information (MI) the basis of study of traditional mutual information approach, When MI is negative, the importance of features is weakened. is one of the most commonly used feature selection method. On this paper analyzed factors of the low classification accuracy. And the word frequency of selected feature is ignored. For this reason, MI is inclined to select lower-frequency features. So the improved method for feature selection is proposed in this paper. In this method, frequency, concentration and distribution are taken into account. The new approach guarantees the important role of negative correlation features, and increased the proportion of high-frequency features. The experimental results verify that the method is far better than traditional methods in terms of the accuracy and stability of the text classification.

关 键 词:文本分类 互信息 特征选择 负相关 频度 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象