一种改进的CHI文本特征选择方法  被引量:5

An Improved CHI Text Feature Selection Algorithm

在线阅读下载全文

作  者:樊存佳 汪友生[1] 王雨婷[1] 

机构地区:[1]北京工业大学电子信息与控制工程学院,北京100124

出  处:《计算机与现代化》2016年第11期7-11,63,共6页Computer and Modernization

摘  要:特征选择是文本分类过程中非常重要的环节。CHI统计是一种经典的特征选择方法,针对CHI统计方法存在的不足,一方面,为了兼顾特征项的文档频和词频,本文在CHI中引入词频因子和类间方差;另一方面,为了排除在指定类中很少出现但在其他类中普遍存在的特征项,降低人为选取比例因子带来的误差,本文在CHI中引入自适应比例因子。实验结果表明,与CHI统计方法相比,改进后的CHI特征选择方法提高了非平衡语料集上的分类准确度。In the process of text classification, feature selection algorithm is a greatly important part. CHI statistics is a classical feature selection method, but it has some defects. Aiming at the shortage of CHI statistics algorithm, on the one hand, in order to take into account the document frequency and word frequency of items, word frequency factor and variance among classes were in- troduced into CHI algorithm. On the other hand, in order to exclude the items which rarely appear in the specified class and largely appear in other classes, and reduce the error of artificially selecting scaling factor, the adaptive scaling factor was intro- duced into CHI algorithm. The results show that the improved CHI feature selection algorithm is superior to CHI statistics algo- rithm in the unbalanced corpus.

关 键 词:CHI统计 词频因子 类间方差 自适应比例因子 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象