基于Spark的改进K-means快速聚类算法被引量：16

Improved K-means fast clustering algorithm based on Spark

作　　者：徐健锐[1,2] 詹永照[1] XU Jianrui;ZHAN Yongzhao(School of Computer Science and Communication Engineering,Jiangsu University,Zhenjiang,Jiangsu 212013,China;Zhenjiang Branch,Jiangsu Union Technical Institute,Zhenjiang,Jiangsu 212016,China)

机构地区：[1]江苏大学计算机科学与通信工程学院,江苏镇江212013 [2]江苏联合职业技术学院镇江分院,江苏镇江212016

出　　处：《江苏大学学报（自然科学版）》2018年第3期316-323,共8页Journal of Jiangsu University：Natural Science Edition

基　　金：国家自然科学基金资助项目(61672268)

摘　　要：针对大数据环境下聚类算法所处理数据规模越来越大、对算法时效性要求越来越高的问题,提出一种基于分布式计算框架Spark的改进K-means快速聚类算法Spark-KM.首先针对K-means算法因初始聚类点选择不当导致局部最优、迭代次数增加而无法适应大规模数据聚类的问题,通过预抽样和最大最小距离相结合对K-means算法进行改进;然后对原始数据进行矩阵分割,并存储在不同的Spark计算框架的结点当中;最后根据改进的K-means算法,结合分布式矩阵计算和Spark平台进行大数据快速聚类.结果表明,文中算法可以有效减少结点间的数据移动次数,并具有良好的可扩展性.通过该算法在单机环境和集群环境的对比测试,说明该算法适用于大规模数据环境,且算法性能与数据规模成正比,集群环境较单机环境也具有很大的性能提高.To solve the problem that the size of data processed by clustering algorithm became bigger and bigger,and the requirement for the timeliness of algorithm also became higher and higher,a fast K-means clustering algorithm of Spark-KM was proposed based on the distributed computing framework Spark.In K-means algorithm,to solve the problems of local optimum due to the improperly initial clustering point and large-scale data clustering due to increased iterative time,the K-means algorithm was improved by pre-sampling and maximum minimum distance combination.The original data was divided into matrix and stored in the nodes of different Spark computing framework.According to the improved K-means algorithm,the Spark platform was combined with the distributed matrix computing to complete fast clustering of large data.The results show that the algorithm can effectively reduce the number of data moving between nodes with good scalability.The contrast test of the algorithms in stand-alone environment and cluster environment shows that the algorithm is suitable for the large-scale data environment,and the performance of the algorithm is proportional to the data size.The performance of cluster environment is greatly improved than that of stand-alone environment.

关键词：改进K-MEANS 预抽样最大最小距离矩阵分割矩阵计算

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Spark的改进K-means快速聚类算法被引量：16

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Spark的改进K-means快速聚类算法 被引量：16

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于Spark的改进K-means快速聚类算法被引量：16