高维数据聚类数量可视化确定模式  

Visualized determination mode for clustering quantity of high-dimensional data

在线阅读下载全文

作  者:何选森 何帆 樊跃平 陈洪军 HE Xuansen;HE Fan;FAN Yueping;CHEN Hongjun(School of Information Technology and Engineering,Guangzhou College of Commerce,Guangzhou 511363,China;College of Information Science and Engineering,Hunan University,Changsha 410082,China;School of Management and Economics,Beijing Institute of Technology,Beijing 100081,China)

机构地区:[1]广州商学院信息技术与工程学院,广州511363 [2]湖南大学信息科学与工程学院,长沙410082 [3]北京理工大学管理与经济学院,北京100081

出  处:《沈阳航空航天大学学报》2024年第3期71-84,共14页Journal of Shenyang Aerospace University

基  金:广东省普通高校重点领域专项(项目编号:2021ZDZX1035)。

摘  要:为了解决经典K-均值聚类算法要求用户事先知道待处理数据的聚类数量及聚类结果对算法的初始化很敏感的问题,提出一种对K-均值聚类算法的改进措施并可视化地确定聚类数量的综合方案。首先,对数据进行标准化,使其服从正态分布,利用主分量分析(princi‐palcomponentanalysis,PCA)抽取数据中最重要的特征以实现高维数据的降维;然后,采用最远质心选择和最小-最大距离规则对K-均值聚类算法的初始化进行修正,避免出现空聚类并确保数据的可分离性;在此基础上,采用统计经验法则估计聚类数量的可能范围,通过搜索在此范围内平方误差和(sum-of-squared-error,SSE)曲线的肘部估计最佳的聚类数量;最后,通过计算比较各个聚类的轮廓系数以评价算法的聚类质量,从而最终确定数据集固有的聚类数量。仿真结果表明,该方案不仅能可视化地确定数据集潜在的聚类数量,而且为大数据时代的高维数据分析提供了一种有效的方法。In order to solve the problem that the classical K-means clustering algorithm reguired users to know the number of clusters in advance and the clustering results were sensitive to initialization of the algorithm,a comprehensive scheme was proposed to improve the random initial partitioning of K-means algorithm and visually determine the number of clusters.Firstly,the data was standardized to make it obey normal distribution,and the most important features were extracted by principal compo‐nent analysis to achieve dimensionality reduction of high-dimensional data.Then,the farthest centroid selection and min-max distance rule were used to modify the random initialization of K-means algo‐rithm to avoid empty clusters and ensure data separability.Based on these,the statistical empirical rule was used to estimate the range of the number of clusters,and the optimal number of clusters was as‐sessed by searching the elbow of sum-of-squared-error curve within this range.Finally,by calculating and comparing the silhouette coefficients of each cluster,the clustering quality of the algorithm was evaluated,thereby ultimately determining the inherent number of clusters in the data.The simulation re‐sults show that the proposed scheme can not only visually determine the potential number of clusters in the data,but also provide an effective method for high-dimensional data analysis in the era of big data.

关 键 词:K-均值聚类算法 主分量分析 最远质心选择 最小-最大距离规则 统计经验法则 肘部法 轮廓分析 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象