基于位置及词频信息的优化CHI文本特征选择方法  被引量:6

An Improved CHI Text Feature Selection Method Based on the Location and Word Frequency Information

在线阅读下载全文

作  者:宋阿羚 刘海峰[1] 刘守生[1] 

机构地区:[1]解放军理工大学理学院,江苏南京

出  处:《计算机科学与应用》2015年第9期322-330,共9页Computer Science and Application

基  金:国家自然科学基金(71071161,61273209);江苏省自然科学基金(BK2012511)。

摘  要:特征选择是文本自动分类的核心技术。针对经典的CHI模型不足之处,本文首先从特征项与类别之间的正负相关性角度对特征项进行删减;然后针对类偏斜分类环境下的特征项权重进行调整;进而以特征项的词频数为依据,从特征项在文本中的具体位置、特征项的类内及类间分布等层面再对模型逐步改进,提出了一种优化的CHI特征选择方法。随后的文本分类试验验证了该方法的有效性。Text feature selection is the core technology of text automatic categorization. Aiming at the short-comings of classical CHI model, we have screened the feature set which is based on the point of view of the positive and negative correlation between the feature and categories firstly. According to the type of deflection classification conditions, we adjust the feature weighting secondly. Thirdly, basing on characteristics of word frequency, we gradually improve the model based on the characteristics of a specific location in the text and the characteristics of distribution of information between classes. Finally, we propose an optimized CHI feature selection method. Text classification experiments demonstrate the effectiveness of the optimized CHI model.

关 键 词:特征选择 χ2统计 相关性 位置分布 类偏斜 

分 类 号:TP39[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象