Hadoop平台下基于快速搜索与密度峰值查找的聚类算法  被引量:2

Clustering Algorithm Based on Fast Search and Density Peaks Lookup on Hadoop Platform

在线阅读下载全文

作  者:郭友雄 黄添强[1,2] 林玲鹏 黄维[1,2] GUO Youxiong, HUANG Tianqiang, LIN Lingpeng, HUANG Wei(Fujian Normal University, Fuzhou, Fujian 350007, China)

机构地区:[1]福建师范大学软件学院,福建福州350007 [2]福建师范大学福建省大数据挖掘与应用工程技术研究中心,福建福州350007

出  处:《福建师大福清分校学报》2018年第2期37-44,109,共9页Journal of Fuqing Branch of Fujian Normal University

基  金:国家自然科学基金项目(61070062;61502103);福建省高校产学合作科技重大项目(2015H6007);福州市科技计划项目(2014-G-76);福建省高等学校新世纪优秀人才支持计划(JAI1038);福建省科学厅K类基金项目(2011007);福建省教育厅A类基金项目(JA10064)

摘  要:针对K-means并行化算法中需要人为初始化起始中心点以及每次迭代都要重复计算所有点与中心点距离的低效率问题提出了一种基于快速搜索与密度峰值查找的并行化算法.采用了"化整为一"原则对算法进行并行化处理,即将每个节点的Map阶段得到的局部CFSFDP聚类结果集中的每个簇视为一个待聚类样本点,在Reduce阶段将这些样本点再进行一次CFSFDP聚类,从而能够快速的将相似的簇聚集在一起归并为同类别.采用Hadoop平台下的并行编程方法,以海量的新闻信息聚类进行实验.实验结果显示,嵌入了基于快速搜索与密度峰值查找的聚类算法后,相对于传统的K-means并行化算法在效率与聚类的结果准确度上都有着明显的提升.A parallel algorithm based on fast search and density peak lookup was proposed in the K-means parallel algorithm, which required the initialization of the starting point and the low efficiency of each iteration to calculate the distance between all points and the center point repeatedly. "Integration as one" was applied as the chief principle to parallelize the algorithm and each cluster of local CFSFDP clustering results in the Map phase of each node was regarded as a sample point to be clustered. At the Reduce stage, these samples were clustered once again by CFSFDP, so that similar clusters could be clustered into the same category quickly. In this paper, we employed the parallel programming method based on Hadoop platform to cluster experiments with massive news information.The experimental results show that the clustering algorithm based on fast search and density peak lookup was improved significantly compared with the traditional K-means parallel algorithm in the efficiency and accuracy of clustering results.

关 键 词:HADOOP 快速搜索与密度峰值查找 聚类 MAPREDUCE 

分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象