基于Spark的并行社区发现算法  被引量:1

Parallelized community detection algorithm based on Spark

在线阅读下载全文

作  者:刘东江 黎建辉[1] Liu Dongjiang;Li Jianhui(Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100190,China)

机构地区:[1]中国科学院计算机网络信息中心,北京100190 [2]中国科学院大学,北京100190

出  处:《计算机应用研究》2020年第8期2255-2260,共6页Application Research of Computers

基  金:国家重点研发计划资助项目(2016YFB1000600);中国科学院战略性先导科技专项资肋项目(XDA06010307)。

摘  要:针对大规模图数据顶点聚类进行研究,提出了一种基于Spark的并行社区发现算法,其在基于极值优化的串行社区发现算法的基础上设计而成。此外还针对该串行算法在簇调整时因选择顶点数量过少而影响算法运行效率的问题,提出了一种多个顶点选择方法。该方法会计算一个阈值并发现所有适应度值小于该阈值的顶点,作为被选择的顶点;由于阈值是基于所有顶点的适应度值计算出来的,为了避免非常大的适应度值对阈值造成的影响该方法会限制被选择顶点的数量,若被选择的顶点过多,算法只保留其中的一部分。同时,还提出了一种顶点过滤方法,其可以有效减少图数据的数据量。实验表明,提出算法的运行时间明显短于比较的其他基于Spark的并行化社区发现算法,可以发现其运行速度相对较快。This paper focused on graph clustering and proposed a parallelized community detection algorithm based on Spark.This algorithm was based on sequence community detection algorithm using extremal optimization.The sequence algorithm tried to choose one vertex each time when adjust the clusters.So it would take long time to adjust the clusters.For this reason,the proposed algorithm adopted a new multi-vertices selection method.This method tried to calculate a threshold value and found all the vertices whose fitness value was smaller than the threshold value.Then the proposed algorithm also needed to change the category of these vertices;besides,as the proposed method toke the fitness values of all the vertices into consideration when tried to calculate the threshold,it needed to select limited number of vertices in order to avoid the influence of extremely large fitness values.So if the proposed method selected great amount of vertices,it would only keep part of them.At the same time,the proposed algorithm also adopted a new vertices filtering method.This method could reduce the volume of graph data efficiently.The results show that proposed algorithm takes shorter time than other parallelized algorithms for comparison.It means that the proposed algorithm runs relatively faster.

关 键 词:社区发现 SPARK 并行算法 图聚类 图数据 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象