基于MapReduce的K-means聚类算法的优化被引量：5

Optimization of K-means Clustering Algorithm Based on MapReduce

出　　处：《计算机测量与控制》2016年第7期272-275,279,共5页Computer Measurement &Control

基　　金：国家自然科学基金项目(11271057;51176016);江苏省自然科学基金项目(BK2009535)

摘　　要：针对传统的聚类算法K-means对初始中心点的选择非常依赖,容易产生局部最优而非全局最优的聚类结果,同时难以满足人们对海量数据进行处理的需求等缺陷,提出了一种基于MapReduce的改进K-means聚类算法。该算法结合系统抽样方法得到具有代表性的样本集来代替海量数据集;采用密度法和最大最小距离法得到优化的初始聚类中心点;再利用Canopy算法得到粗略的聚类以降低运算的规模;最后用顺序组合MapReduce编程模型的思想实现了算法的并行化扩展,使之能够充分利用集群的计算和存储能力,从而适应海量数据的应用场景;文中对该改进算法和传统聚类算法进行了比较,比较结果证明其性能优于后者;这表明该改进算法降低了对初始聚类中心的依赖,提高了聚类的准确性,减少了聚类的迭代次数,降低了聚类的时间,而且在处理海量数据时表现出较大的性能优势。To deal with the problems that traditional K-means clustering algorithm is very dependent on the selection of the initial points,being prone to clustering result of local optimum rather than global optimum,and it is difficult to meet the need of dealing with massive amounts of data,an improved K-means clustering algorithm based on MapReduce is proposed.The algorithm combines systematic sampling method to get a representative sample set which is used to replace the massive data set;and uses density method and Max-Min distance method to get the optimal initial clustering centers;and adopts Canopy algorithm to get a rough clustering which can reduce the computational scale;and finally employs the idea of sequential composition of MapReduce programming model to realize the parallel extension of the algorithm,which can make full use of the computing and storage capacity of the cluster,in order to adapt to the application of massive data.The improved algorithm is compared with the traditional clustering algorithms in this paper,and the comparative results show that the performance of improved algorithm is better than the latter.The experiments show that the improved method reduces the dependence on the initial cluster centers and also reduces the number of iterations of clustering and the clustering time.Furthermore it shows greater performance advantage in dealing with massive data.

关键词：K均值算法抽样 Canopy算法最大最小距离法

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于MapReduce的K-means聚类算法的优化被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于MapReduce的K-means聚类算法的优化 被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于MapReduce的K-means聚类算法的优化被引量：5