基于Spark的层次聚类算法的并行化研究  被引量:6

Research on Parallelization of Hierarchical Clustering Algorithm Based on Spark

在线阅读下载全文

作  者:余胜辉 李玲娟[1] YU Sheng-hui;LI Ling-juan(School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)

机构地区:[1]南京邮电大学计算机学院,江苏南京210023

出  处:《计算机技术与发展》2020年第6期19-22,共4页Computer Technology and Development

基  金:国家重点研发计划专项(2017YFB1401302,2017YFB0202200);国家自然科学基金(61572260,61872196)。

摘  要:随着大数据时代的来临,传统的计算模式已经不足以支撑如此大量的数据。基于内存计算的大数据并行化计算框架Spark的出现很好地解决了这一问题。CURE是一种基于取样和代表点的层次聚类算法,它采用迭代的方式,自底向上地合并两个距离最近的簇。与传统的聚类算法相比,CURE算法对异常点的敏感度更小。但是在处理大量数据的情况下,CURE算法存在着因反复迭代而消耗大量时间的问题。文中利用了Spark的RDD编程模型的可伸缩性和分布式等特点,实现了对CURE算法计算过程的并行化,提升了该算法对数据的处理速度,使算法能够适应数据规模的扩展,并且提高了聚类的性能。在Spark上运用CURE算法对公开数据集的并行化处理结果表明,基于Spark的CURE算法并行化既保证了聚类准确率又提高了算法的时效性。With the advent of the era of big data,traditional computing models are not enough to support such a large amount of data.The emergence of Spark,a big data parallel computing framework based on in-memory computing,solves this problem well.CURE is a hierarchical clustering algorithm based on sampling and representative points,and uses an iterative method to merge two closest clusters from the bottom up.Compared with traditional clustering algorithm,CURE algorithm is less sensitive to outliers.However,in the case of processing large amounts of data,the CURE algorithm has the problem of consuming a lot of time due to repeated iterations.We utilize the scalability and distributed characteristics of Spark’s RDD programming model to realize the parallelization of the computing process of CRUE algorithm,which improves the speed of data processing,makes the algorithm adapt to the expansion of data scale,and improves the performance of clustering.The parallelization of the public dataset using CURE algorithm on Spark shows that the parallelization of Spark-based CURE algorithm not only ensures the clustering accuracy but also improves the timeliness of the algorithm.

关 键 词:SPARK 层次聚类 CURE RDD 并行化 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象