Spark平台中的并行化FP_growth关联规则挖掘方法  被引量:5

Parallel FP_growth Association Rules Mining Method on Spark Platform

在线阅读下载全文

作  者:朱岸青[1] 李帅[2] 唐晓东[3] ZHU An-qing;LI Shuai;TANG Xiao-dong(School of Management,Jinan University,Guangzhou 510000,China;School of Computer Science and Engineering,Beihang University,Beijing 100191,China;School of Economics and Management,South China Normal University,Guangzhou 510006,China)

机构地区:[1]暨南大学管理学院,广州510000 [2]北京航空航天大学计算机学院,北京100191 [3]华南师范大学经济与管理学院,广州510006

出  处:《计算机科学》2020年第12期139-143,共5页Computer Science

基  金:广州市专利技术产业化项目(201601010207);国家自然科学基金面上项目(61672077);国家重点研发计划(2017YFF0106407);2017国家自然科学基金青年基金项目(61702026)。

摘  要:为了提高关联规则挖掘效率,文中提出了一种适用于Spark平台的并行化FP_growth关联规则挖掘方法。首先,利用Spark平台在分布式系统中的所有节点的内存RDD中完成遍历扫描运算,得到频繁集,以便生成FP_Table并更新FP_Tree。然后,引入时间序列来预测待挖掘的项目集,以便实现分布式系统中的所有节点能够均衡分担挖掘任务,从而充分利用各节点的FP_Tree遍历功能,获取FP_growth关联规则挖掘结果。实验结果显示,相比单机情况,并行化FP_growth关联规则挖掘在效率方面提高了约60%。经过负载均衡处理后的FP_growth关联规则挖掘的效率更高,提高了约14%,这说明各节点遍历任务的分配更均衡,并行化程度更高。In order to improve the efficiency of association rule mining,a parallel FP_growth association rule mining method suitable for spark platform is proposed.First,the Spark platform is used to complete the traversal scan operation in the memory RDD of all nodes of the distributed system to obtain frequent sets in order to generate FP_Table and update FP_Tree.Then,the time series is introduced to predict the itemsets to be mined,so that all nodes in the distributed system can share the mining tasks in a balanced manner,so as to make full use of the traversal FP_Tree calculation function of each node to obtain the FP_growth association rule mining results.The experimental results show that compared to the single machine case,the parallelized FP_growth association rule mining improves the efficiency by about 60%.After the load balancing process,the mining efficiency of the FP_growth association rule is higher,increasing by about 14%,which indicates that the traversal task allocation of each node is more balanced and the degree of parallelism is higher.

关 键 词:Spark平台 FP_GROWTH算法 关联规则挖掘 频繁集 负载均衡 

分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象