基于词频的优化互信息文本特征选择方法被引量：13

Optimization Mutual Information Text Feature Selection Method Based on Word Frequency

出　　处：《计算机工程》2014年第7期179-182,共4页Computer Engineering

基　　金：国家自然科学基金资助项目(71071161;61273209);江苏省自然科学基金资助项目(BK2012511)

摘　　要：互信息(MI)是一种常用的文本特征选择方法,经典MI方法未考虑同一个特征项在不同类别内频数的差异性,也未考虑同一个特征在同一类别内的不同文本之间分布上的差异性。针对上述不足,以特征项的频数为依据,分别从特征项的类内分布、类间分布上的差异以及类内不同文本之间分布上的差异等角度,通过引入特征项的类内频数因子、类内位置分布因子以及类间分布因子,提出一种改进的MI文本特征选择方法,使得特征项的频数信息在MI模型中得到有效利用,合理改善互信息模型在文本特征选择方面的不足。文本分类实验结果表明,改进MI文本特征选择方法的平均准确率、召回率分别提高约5.2%及4.6%,平均综合评价指标值提高约4.9%,有效提高了模型的文本分类效率。Mutual Information（MI） is a kind of text feature selection method commonly used. The classical mutual information method does not consider the same characteristic frequency in different categories of difference. And more, MI does not take into account the difference that the same feature in the same sort between different texts. Aiming at the shortcomings of MI model, the frequency feature as the basis, from the perspective of internal distribution calegory feature and from the point of the distribution among different types of feature, the model is optimized. Through the frequency factor and the factor distribution within class and the factor distribution between classes are introduced, the feature frequency information is used in the MI. This paper improves the MI efficiency in the text feature selection. Text classification experimental results show that the average accuracy rate, recall rate of the improved MI model are improved by about 5.2% and 4.6%. And more, the average F1 value increases by about 4.9%.

关键词：文本分类特征选择互信息特征频率特征降维类内分布

分类号：TP18[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于词频的优化互信息文本特征选择方法被引量：13

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于词频的优化互信息文本特征选择方法 被引量：13

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于词频的优化互信息文本特征选择方法被引量：13