检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]常州大学信息科学与工程学院,江苏常州213164
出 处:《计算机测量与控制》2016年第7期272-275,279,共5页Computer Measurement &Control
基 金:国家自然科学基金项目(11271057;51176016);江苏省自然科学基金项目(BK2009535)
摘 要:针对传统的聚类算法K-means对初始中心点的选择非常依赖,容易产生局部最优而非全局最优的聚类结果,同时难以满足人们对海量数据进行处理的需求等缺陷,提出了一种基于MapReduce的改进K-means聚类算法。该算法结合系统抽样方法得到具有代表性的样本集来代替海量数据集;采用密度法和最大最小距离法得到优化的初始聚类中心点;再利用Canopy算法得到粗略的聚类以降低运算的规模;最后用顺序组合MapReduce编程模型的思想实现了算法的并行化扩展,使之能够充分利用集群的计算和存储能力,从而适应海量数据的应用场景;文中对该改进算法和传统聚类算法进行了比较,比较结果证明其性能优于后者;这表明该改进算法降低了对初始聚类中心的依赖,提高了聚类的准确性,减少了聚类的迭代次数,降低了聚类的时间,而且在处理海量数据时表现出较大的性能优势。To deal with the problems that traditional K-means clustering algorithm is very dependent on the selection of the initial points,being prone to clustering result of local optimum rather than global optimum,and it is difficult to meet the need of dealing with massive amounts of data,an improved K-means clustering algorithm based on MapReduce is proposed.The algorithm combines systematic sampling method to get a representative sample set which is used to replace the massive data set;and uses density method and Max-Min distance method to get the optimal initial clustering centers;and adopts Canopy algorithm to get a rough clustering which can reduce the computational scale;and finally employs the idea of sequential composition of MapReduce programming model to realize the parallel extension of the algorithm,which can make full use of the computing and storage capacity of the cluster,in order to adapt to the application of massive data.The improved algorithm is compared with the traditional clustering algorithms in this paper,and the comparative results show that the performance of improved algorithm is better than the latter.The experiments show that the improved method reduces the dependence on the initial cluster centers and also reduces the number of iterations of clustering and the clustering time.Furthermore it shows greater performance advantage in dealing with massive data.
关 键 词:K均值算法 抽样 Canopy算法 最大最小距离法
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.4