基于Spark的K-means快速聚类算法的优化被引量：17

Optimization of K-Means Fast Clustering Algorithm Based on Spark

作　　者：王全民胡德程 WANG Quan-min;HU De-cheng(Department of Information,Beijing University of Technology,Beijing 100022,China)

机构地区：[1]北京工业大学信息学部,北京100022

出　　处：《计算机仿真》2022年第3期344-349,共6页Computer Simulation

基　　金：北京市自然科学基金(4202004)。

摘　　要：针对聚类算法处理海量数据所存在的不足,提出基于Spark的K-means快速聚类算法的优化。使用形态学相似距离代替欧氏距离作为相似度测量标准来提高聚类准确率;通过最大距离(Max-distince)准则改进因初始聚类中心选取不当而造成的局部最优问题;为减少迭代过程中的冗余计算,利用数据集中点的位置信息与聚类质心的位置关系建立网格结构。综合肘部法则绘制误差平方和SSE-K的关系图确定K值,并在Spark实现SMGK-means(SparkMaxGridK-means)聚类算法。通过实验表明,SMGK-means算法不仅准确率平均提高了6.73%,而且在Spark分布式集群下表现出优秀的执行效率和并行计算能力。Aiming at the shortcomings of clustering algorithm in processing massive data,an optimization of K-means fast clustering algorithm based on Spark is proposed.Morphological similarity distance instead of Euclidean distance was used as similarity measurement standard to improve clustering accuracy;The maximum distance criterion was used to improve the local optimization problem caused by the improper selection of initial clustering centers;In order to reduce the redundant calculation in the iterative process,the grid structure was established by using the position information of points in the data set and the position relationship of clustering centroid.The elbow rule was synthesized to draw a relation graph of the error square and SSE-K to determine the K value,and the SMGK-means(SparkMaxGridK-means)clustering algorithm was implemented in Spark.Experiments show that the SMGK-means algorithm not only improves the accuracy by 6.73%on average,but also shows excellent execution efficiency and parallel computing capabilities under the Spark distributed cluster.

关键词：形态学相似距离最大距离位置关系

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Spark的K-means快速聚类算法的优化被引量：17

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Spark的K-means快速聚类算法的优化 被引量：17

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于Spark的K-means快速聚类算法的优化被引量：17