基于改进CHI和PCA的文本特征选择  被引量:5

Text feature selection based on improved CHI and PCA

在线阅读下载全文

作  者:文武[1,2,3] 万玉辉 张许红 文志云 WEN Wu;WAN Yu-hui;ZHANG Xu-hong;WEN Zhi-yun(School of Communication and Information Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065;Research Center of New Telecommunication Technology,Chongqing University of Posts and Telecommunications,Chongqing 400065;Chongqing Information Technology Designing Co.Ltd.,Chongqing 401121,China)

机构地区:[1]重庆邮电大学通信与信息工程学院,重庆400065 [2]重庆邮电大学通信新技术应用研究中心,重庆400065 [3]重庆信科设计有限公司,重庆401121

出  处:《计算机工程与科学》2021年第9期1645-1652,共8页Computer Engineering & Science

摘  要:针对文本数据中含有大量噪声和冗余特征,为获取更有代表性的特征集合,提出了一种结合改进卡方统计(ICHI)和主成分分析(PCA)的特征选择算法(ICHIPCA)。首先针对CHI算法忽略词频、文档长度、类别分布及负相关特性等问题,引入相应的调整因子来完善CHI计算模型;然后利用改进后的CHI计算模型对特征进行评价,选取靠前特征作为初选特征集合;最后通过PCA算法在基本保留原始信息的情况下提取主要成分,实现降维。通过在KNN分类器上验证,与传统特征选择算法IG、CHI等同类型算法相比,ICHIPCA算法在多种特征维度及多个类别下,实现了分类性能的提升。Aiming at the large amount of noise and redundant features in text data,in order to obtain a more representative feature set,a feature selection algorithm(ICHIPCA)combining improved CHI-square statistics(ICHI)and principal component analysis(PCA)is proposed.Firstly,the CHI algorithm ignores word frequency,document length,category distribution,and negative correlation characteristics,and introduces corresponding adjustment factors to improve the CHI calculation model.Secondly,the improved CHI calculation model is used to evaluate the features,and selects the top features as the primary selection feature set.Finally,PCA algorithm is used to extract the main components while basically retaining the original information to achieve dimensionality reduction.Verification on the KNN classifier shows that,compared with the traditional feature selection algorithm IG and CHI equivalent type algorithm,the ICHIPCA algorithm improves the classification performance in multiple feature dimensions and multiple categories.

关 键 词:文本分类 PCA CHI 降维 特征选择 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象