机构地区:[1]河北科技大学信息科学与工程学院,河北石家庄050018 [2]河北科技大学经济管理学院,河北石家庄050018 [3]河北科技大学大数据与社会计算研究中心,河北石家庄050018 [4]中国人民解放军陆军工程大学石家庄校区,河北石家庄050005
出 处:《河北科技大学学报》2022年第2期194-203,共10页Journal of Hebei University of Science and Technology
基 金:国家文化和旅游科技创新工程项目(2020年度);河北省省级科技计划资助项目(20310802D,21310101D)。
摘 要:重叠社区发现是复杂网络挖掘中的重要基础工作,可以应用于社交网络、通讯网络、蛋白质相互作用网络、代谢路径网络、交通网络等多种网络的数据分析,从而服务智慧交通、传染病防治、舆情分析、新药研制和人力资源管理等领域。传统的单机运算架构已经难以满足各类大规模复杂网络的分析和计算要求。人工智能领域的研究人员提出将社区发现应用到网络表示学习领域,以丰富网络中节点和边的特征,但传统的重叠社区发现算法在设计时未能考虑来自网络表示学习任务的相关要求,只重点关注节点的社区划分,缺乏对社区内部结构和外部边界的考虑,例如没有涉及节点在社区内部的权重和属于多个社区的归属度排序等,因而不能提供网络中节点和社区更丰富的特征信息,导致对网络表示学习任务支持不足。针对传统单机重叠社区发现算法已经不适用于大规模复杂网络挖掘,以及不能满足网络表示学习任务的相关要求等问题,提出一种基于社区森林模型的分布式重叠社区发现算法(distributed community forest model,简称DCFM算法)。首先,将网络数据集存储到分布式文件系统,将数据分块,使用分布式计算框架在每个数据分块上执行CFM算法;然后,执行社区合并;最后,汇总社区划分结果,使用真实的DBLP数据集将算法运行于Spark集群上,采用F均值和运行时间对算法进行评估。结果表明,DCFM算法的F均值稍逊于CFM算法,但其运算时间随着节点的增加接近线性下降,在牺牲小部分F均值的同时,DCFM算法具备处理大规模网络数据的能力;分割份数对计算时间的影响很大,在com-dblp.ungraph.txt数据集上,CFM算法处理数据需要192 min,而DCFM算法在将数据分成6份时,需要约91 min,分成100份后仅需要约13 min。因此,在大数据平台上采用分布式计算骨干度,从而进行社区划分、合并的DCFM算法是一种�Overlapping community discovery is an important basic work in complex network mining.It can be applied to the data analysis of social networks,communication networks,protein interaction networks,metabolic path networks,transportation networks and other networks,so as to serve the fields of intelligent transportation,infectious disease prevention and control,public opinion analysis,new drug development and human resource management.The traditional stand-alone computing architecture has been difficult to meet the analysis and computing requirements of various large-scale complex networks.Researchers in the field of artificial intelligence propose to apply community discovery to the field of network representation learning to enrich the characteristics of nodes and edges in the network.However,the traditional overlapping community discovery algorithm fails to consider the relevant requirements from the network representation learning task in its design,only focuses on the community division of nodes,and lacks consideration of the internal structure and external boundary of the community.For example,it does not involve the weight of nodes within the community and the attribution ranking belonging to multiple communities,so it cannot provide richer characteristic information of nodes and communities in the network,resulting in insufficient support for network representation learning tasks.Aiming at the problem that the traditional single machine overlapping community discovery algorithm is not suitable for large-scale complex network mining and cannot support the relevant requirements of network representation learning tasks,a distributed overlapping community discovery algorithm based on community forest model(DCFM algorithm)was proposed.Firstly,the network dataset was stored in the distributed file system,the data were divided into blocks,and the distributed computing framework was used to execute the CFM algorithm on each data block;then,the community consolidation was performed;Finally,the community division resul
关 键 词:分布式处理系统 社交网络 重叠社区 社区森林模型 社区发现
分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...