基于特征矩阵优化与数据降维的文本聚类算法  被引量:19

Text Clustering Algorithm Based on Feature Matrix Optimization and Data Dimensionality Reduction

在线阅读下载全文

作  者:陈玮[1] 卢佳伟 CHEN Wei;LU Jiawei(School of Optical Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China)

机构地区:[1]上海理工大学光电信息与计算机工程学院,上海200093

出  处:《数据采集与处理》2021年第3期587-594,共8页Journal of Data Acquisition and Processing

摘  要:针对文本聚类问题中因为维度灾难以及特征信息丢失而导致的聚类效果低效问题,本文提出一种基于特征矩阵优化与改进主成分分析(Principal component analysis,PCA)降维的聚类算法。在原基于文档频率和逆词频(Term frequency inverse document frequency,TF-IDF)算法的基础上提出ALFW(Adaptive length frequency weight)权重优化方案,使得特征矩阵的分布性更好,特征项的表征更加明显。在降维处理上,采用信息论中的联合熵标准对PCA算法进行了优化,提出UE-PCA(United entropy-PCA)算法对稀疏高维数据进一步降维,更好地保留了原高维数据的真实性。仿真实验表明,本文提出的算法(K-means+UE-PCA+ALFW)对比其他同类型算法取得了更好的表现效果。Aiming at inefficient clustering due to dimensional disaster and loss of feature information in text clustering,this paper proposes a clustering algorithm based on feature matrix optimization and improved principal component analysis(PCA)dimensionality reduction.On the basis of the original term frequency inverse document frequency(TF-IDF)algorithm,an adaptive length frequency weight(ALFW)optimization scheme is proposed,which makes the distribution of the feature matrix better and the characterization of the feature terms more obvious.In the process of dimensionality reduction,the PCA algorithm is optimized by using the joint entropy standard in information theory,and the UE-PCA(United entropy-PCA)algorithm is proposed to further reduce the dimensionality of sparse high-dimensional data and better retain the authenticity of the original high-dimensional data.Simulation experiments show that the proposed algorithm(K-means+UEPCA+ALFW)achieves better performance than other similar algorithms.

关 键 词:文本聚类 特征矩阵 联合熵 TF-IDF算法 PCA 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象