检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:徐健锐[1,2] 詹永照[1] XU Jianrui;ZHAN Yongzhao(School of Computer Science and Communication Engineering,Jiangsu University,Zhenjiang,Jiangsu 212013,China;Zhenjiang Branch,Jiangsu Union Technical Institute,Zhenjiang,Jiangsu 212016,China)
机构地区:[1]江苏大学计算机科学与通信工程学院,江苏镇江212013 [2]江苏联合职业技术学院镇江分院,江苏镇江212016
出 处:《江苏大学学报(自然科学版)》2018年第3期316-323,共8页Journal of Jiangsu University:Natural Science Edition
基 金:国家自然科学基金资助项目(61672268)
摘 要:针对大数据环境下聚类算法所处理数据规模越来越大、对算法时效性要求越来越高的问题,提出一种基于分布式计算框架Spark的改进K-means快速聚类算法Spark-KM.首先针对K-means算法因初始聚类点选择不当导致局部最优、迭代次数增加而无法适应大规模数据聚类的问题,通过预抽样和最大最小距离相结合对K-means算法进行改进;然后对原始数据进行矩阵分割,并存储在不同的Spark计算框架的结点当中;最后根据改进的K-means算法,结合分布式矩阵计算和Spark平台进行大数据快速聚类.结果表明,文中算法可以有效减少结点间的数据移动次数,并具有良好的可扩展性.通过该算法在单机环境和集群环境的对比测试,说明该算法适用于大规模数据环境,且算法性能与数据规模成正比,集群环境较单机环境也具有很大的性能提高.To solve the problem that the size of data processed by clustering algorithm became bigger and bigger,and the requirement for the timeliness of algorithm also became higher and higher,a fast K-means clustering algorithm of Spark-KM was proposed based on the distributed computing framework Spark.In K-means algorithm,to solve the problems of local optimum due to the improperly initial clustering point and large-scale data clustering due to increased iterative time,the K-means algorithm was improved by pre-sampling and maximum minimum distance combination.The original data was divided into matrix and stored in the nodes of different Spark computing framework.According to the improved K-means algorithm,the Spark platform was combined with the distributed matrix computing to complete fast clustering of large data.The results show that the algorithm can effectively reduce the number of data moving between nodes with good scalability.The contrast test of the algorithms in stand-alone environment and cluster environment shows that the algorithm is suitable for the large-scale data environment,and the performance of the algorithm is proportional to the data size.The performance of cluster environment is greatly improved than that of stand-alone environment.
关 键 词:改进K-MEANS 预抽样 最大最小距离 矩阵分割 矩阵计算
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.119