基于MapReduce和改进密度峰值的划分聚类算法  被引量:7

Partition clustering algorithm based on MapReduce and improved density peak

在线阅读下载全文

作  者:黄学雨[1] 向驰 陶涛 Huang Xueyu;Xiang Chi;Tao Tao(School of Information Engineering,Jiangxi University of Science&Technology,Ganzhou Jiangxi 341000,China)

机构地区:[1]江西理工大学信息工程学院,江西赣州341000

出  处:《计算机应用研究》2021年第10期2988-2993,3024,共7页Application Research of Computers

基  金:国家重点研发计划项目(2020YFB1713700)。

摘  要:对于基于划分的聚类算法随机选取初始聚类中心导致初始中心敏感,聚类结果不稳定、集群效率低等问题,提出一种基于MapReduce框架和改进的密度峰值的划分聚类算法(based on MapReduce framework and improved density peak partition clustering algorithm,MR-IDPACA)。首先,通过自然最近邻定义新的局部密度计算方式,将搜索样本密度峰值点作为划分聚类算法的初始聚类中心;其次针对算法在大规模数据下运行时间复杂,提出基于E2LSH(exact Euclidean locality sensitive hashing)的一种分区方法,即KLSH(K of locality sensitive hashing)。通过该方法对数据分区后结合MapReduce框架并行搜寻初始聚类中心,有效减少了算法在搜索初始聚类中心时的运行时间;对于MapReduce框架中的数据倾斜问题,提出ME(multistage equilibrium)策略对中间数据进行多段均衡分区,以提升算法运行效率;在MapReduce框架下并行聚类,得到最终聚类结果。实验得出MR-IDPACA算法在单机环境下有着较高的准确率和较强的稳定性,集群性能上也有着较好的加速比和运行时间,聚类效果有所提升。Aiming at clustering algorithm based on partition to randomly select the initial cluster center,which leads to the sensitivity of the initial center,unstable clustering result,low cluster efficiency,etc.,this paper proposed a partition clustering algorithm based on MapReduce framework and improved density peak,named MR-IDPACA.Firstly,this paper defined a new local density calculation method by natural nearest neighbors,and then searched for the peak point of the sample density as the initial cluster center of the partitioning clustering algorithm.Secondly,in viewed of the complex running time of the algorithm under large-scale data,it proposed an algorithm based on E2LSH,named KLSH.In this method,the data was partitioned and combined with the MapReduce framework to search the initial cluster centers in parallel,which effectively reduced the running time of the algorithm when searching for the initial cluster centers.Next,for the data skew problem in the MapReduce framework,it proposed the ME strategy to divide the intermediate data into multi-segment equilibrium to improve the efficiency of the algorithm.Finally,parallel clustering under the MapReduce framework to obtain the final clustering result.The experiment shows that the MR-IDPACA algorithm has higher accuracy and stronger stability in a single-machine environment,and the cluster performance also has a better speedup ratio and running time,and the clustering effect has been improved.

关 键 词:划分聚类算法 密度峰值 自然最近邻 MAPREDUCE 数据倾斜 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象