检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:黄学雨[1] 向驰 陶涛 Huang Xueyu;Xiang Chi;Tao Tao(School of Information Engineering,Jiangxi University of Science&Technology,Ganzhou Jiangxi 341000,China)
机构地区:[1]江西理工大学信息工程学院,江西赣州341000
出 处:《计算机应用研究》2021年第10期2988-2993,3024,共7页Application Research of Computers
基 金:国家重点研发计划项目(2020YFB1713700)。
摘 要:对于基于划分的聚类算法随机选取初始聚类中心导致初始中心敏感,聚类结果不稳定、集群效率低等问题,提出一种基于MapReduce框架和改进的密度峰值的划分聚类算法(based on MapReduce framework and improved density peak partition clustering algorithm,MR-IDPACA)。首先,通过自然最近邻定义新的局部密度计算方式,将搜索样本密度峰值点作为划分聚类算法的初始聚类中心;其次针对算法在大规模数据下运行时间复杂,提出基于E2LSH(exact Euclidean locality sensitive hashing)的一种分区方法,即KLSH(K of locality sensitive hashing)。通过该方法对数据分区后结合MapReduce框架并行搜寻初始聚类中心,有效减少了算法在搜索初始聚类中心时的运行时间;对于MapReduce框架中的数据倾斜问题,提出ME(multistage equilibrium)策略对中间数据进行多段均衡分区,以提升算法运行效率;在MapReduce框架下并行聚类,得到最终聚类结果。实验得出MR-IDPACA算法在单机环境下有着较高的准确率和较强的稳定性,集群性能上也有着较好的加速比和运行时间,聚类效果有所提升。Aiming at clustering algorithm based on partition to randomly select the initial cluster center,which leads to the sensitivity of the initial center,unstable clustering result,low cluster efficiency,etc.,this paper proposed a partition clustering algorithm based on MapReduce framework and improved density peak,named MR-IDPACA.Firstly,this paper defined a new local density calculation method by natural nearest neighbors,and then searched for the peak point of the sample density as the initial cluster center of the partitioning clustering algorithm.Secondly,in viewed of the complex running time of the algorithm under large-scale data,it proposed an algorithm based on E2LSH,named KLSH.In this method,the data was partitioned and combined with the MapReduce framework to search the initial cluster centers in parallel,which effectively reduced the running time of the algorithm when searching for the initial cluster centers.Next,for the data skew problem in the MapReduce framework,it proposed the ME strategy to divide the intermediate data into multi-segment equilibrium to improve the efficiency of the algorithm.Finally,parallel clustering under the MapReduce framework to obtain the final clustering result.The experiment shows that the MR-IDPACA algorithm has higher accuracy and stronger stability in a single-machine environment,and the cluster performance also has a better speedup ratio and running time,and the clustering effect has been improved.
关 键 词:划分聚类算法 密度峰值 自然最近邻 MAPREDUCE 数据倾斜
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49