SCoS:基于Spark的并行谱聚类算法设计与实现  被引量:13

SCoS:The Design and Implementation of Parallel Spectral Clustering Algorithm Based on Spark

在线阅读下载全文

作  者:朱光辉[1,2] 黄圣彬 袁春风 黄宜华[1,2] ZHU Guang-Hui ;HUANG Sheng-Bin ;YUAN Chun-Feng ;HUANG Yi-Hua(State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210046;Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210046)

机构地区:[1]南京大学计算机软件新技术国家重点实验室,南京210046 [2]江苏省软件新技术与产业化协同创新中心,南京210046

出  处:《计算机学报》2018年第4期868-885,共18页Chinese Journal of Computers

基  金:企业合作研究项目"大规模软件分析与系统研究";国家自然科学基金项目(61572250);江苏省科技支撑计划项目(BE2017155)资助

摘  要:谱聚类是一种比传统聚类算法更为高效的算法,其建立在谱图理论基础上,并将聚类问题转化为图的最优划分问题.与传统k-means算法不同的是,谱聚类算法不仅能够在任意形状的样本空间上实现聚类,而且可以收敛至全局最优解.然而,谱聚类算法的计算开销较大,不仅需要计算任意两个样本之间的相似性,而且还需要计算Laplacian矩阵的特征向量.因此,在大规模数据场景下,谱聚类算法存在计算耗时过长甚至无法完成计算的问题.为了解决谱聚类算法在大规模数据场景下的计算性能问题,使得谱聚类算法能够应用在大数据集上,文中基于Apache Spark分布式并行计算框架研究并实现了大规模并行谱聚类算法SCoS,对算法流程中的每个计算步骤进行了并行化.具体的,SCoS主要实现了相似度矩阵构建与稀疏化过程的并行化、Laplacian矩阵构建与正规化过程的并行化、正规化Laplacian矩阵特征向量计算的并行化以及k-means聚类的并行化.为了降低谱聚类算法中大规模样本相似性计算的开销,SCoS采用了基于多轮迭代的并行计算方式实现大规模样本之间的相似性计算.针对大规模谱聚类算法中耗时较长的Laplacian矩阵特征向量求解问题,SCoS基于ScaLAPACK实现了特征向量的并行化求解,同时文中也实现了近似特征向量计算算法,并且对比分析了精确特征向量计算与近似特征向量计算对于谱聚类算法的性能影响.为了进一步提升大规模谱聚类算法的性能,SCoS采取了矩阵稀疏化表示与存储、Laplacian矩阵乘法优化以及k-means聚类中距离计算放缩剪枝等多种优化手段,尽可能地减少计算开销、存储空间开销以及数据传输开销.实验表明,SCoS不仅在聚类效果上要优于传统的聚类算法,而且具有较高的运行效率,特别是在大规模数据集下,仍具有较高的计算性能,并表现出了良好的数据可扩展性和系统可扩展�Spectral clustering is a more efficient algorithm than traditional clustering algorithms.In spectral clustering algorithm,the clustering problem is transformed into optimal graph partition problem according to the spectral graph theory.Compared with k-means algorithm,the difference is that spectral clustering algorithm is not only able to achieve clustering on the sample space of arbitrary shape,but also able to converge to the global optimal solution.However,the large computation overhead is a main problem of spectral clustering algorithm,which requires not only the calculation of pairwise similarities between all samples but also the calculation of eigenvectors of Laplacian matrix.Therefore,in some big data scenarios,the computation time of spectral clustering algorithm is too long to tolerate and the spectral clustering algorithm may not work efficiently.In order to improve the performance of spectral clustering algorithm in big data,a large-scale parallel spectral clustering algorithm named SCoS is studied and realized in this paper with Apache Spark distributed and parallel computing framework.In SCoS,each computational step in spectral clustering algorithm procedure is implemented in parallel.Specifically,the main steps of spectral clustering algorithm,including similarity matrix construction and sparsification,Laplacian matrix construction and normalization,normalized Laplacian matrix eigenvector computation and k-means clustering are all parallelized in SCoS.In order to reduce the computation overhead of large scale similarity measurement in spectral clustering algorithm,SCoS adopts a novel parallel similarity computation approach based on multi-round iteration technique,which guarantees that the similarity of any two samples is calculated only once and the duplicated computation can thus be avoided.Aiming at getting over the time-consuming Laplacian matrix eigenvector computation problem in large-scale spectral clustering algorithm,an efficiently parallel eigenvector computation method is proposed in SCo

关 键 词:谱聚类 并行化 相似性度量 分布式计算 APACHE SPARK 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象