基于Spark的并行频繁项集挖掘算法  被引量:6

A PARALLEL FREQUENT ITEMSETS MINING ALGORITHM BASED ON SPARK

在线阅读下载全文

作  者:张素琪[1] 孙云飞 武君艳 顾军华 Zhang Suqi;Sun Yunfei;Wu Junyan;Gu Junhua(School of Information Engineering, Tianjin University of Commerce, Tianjin 300134, China;School of Artificial Intelligence and Data Science Institute, Hebei University of Technology, Tianjin 300401, China;Hebei Key Laboratory of Big Data Computing, Tianjin 300401, China)

机构地区:[1]天津商业大学信息工程学院,天津300134 [2]河北工业大学人工智能与数据科学学院,天津300401 [3]河北省大数据计算重点实验室,天津300401

出  处:《计算机应用与软件》2019年第2期24-28,143,共6页Computer Applications and Software

基  金:河北省科技计划项目(17210305D);天津市科技计划项目(15ZXHLGX00130;16ZXHLSF0023)

摘  要:关联规则挖掘是数据挖掘领域的重要研究方向之一。频繁项集的挖掘是关联规则挖掘的第一步,也是最重要的步骤。FP-Growth(Frequent Pattern-Growth)算法因其挖掘效率以及空间复杂度方面的优势被广泛应用于频繁项集挖掘任务中。面对海量数据,FP-Growth算法挖掘效率变得极低甚至失效。在Hadoop大数据平台上实现的基于MapReduce框架的并行FP-Growth算法——PFP算法解决在处理大规模数据时传统算法失效的问题,但是由于其将每次执行之后的中间结果输出到磁盘,降低算法执行效率。为提高并行FP-Growth算法执行效率,提出一种基于Spark的SPFPG算法。该算法运用负载均衡思想对分组策略进行改进,综合考虑分区计算量和FP-Tree规模两个因素,保证每个组之间负载总和近似相等。在Spark上实现FP-Growth算法——SFPG算法的基础上,实现优化后的SPFPG算法。实验结果表明,SPFPG算法相比SFPG算法挖掘效率更高,且算法具有良好的扩展性。Association rule mining is one of the important research directions in data mining.The mining of frequent itemsets is the first and most important step in association rule mining.FP-Growth algorithm is widely used in frequent itemsets mining tasks because of its mining efficiency and its spatial complexity.Faced with big data, mining efficiency of FP-Growth algorithm becomes extremely low or even invalid.PFP algorithm, the parallel FP-Growth algorithm based on MapReduce framework implemented on Hadoop, solves the problem of failure of traditional algorithms when dealing with large-scale data.However, the efficiency of the algorithm is reduced because it output the intermediate results to disk after each execution.In order to improve the execution efficiency of parallel FP-Growth algorithm, a Spark-based SPFPG algorithm was proposed in the paper.The algorithm adopted the load balancing idea to improve the packet strategy. Considering the partition calculation and FP-Tree size, the sum of loads among each group was approximately equal. On the basis of SFPG, FP-Growth implemented on Spark, the optimized SPFPG algorithm was realized.The experimental results show that SPFPG algorithm is more efficient than SFPG algorithm and it has a good scalability.

关 键 词:大数据平台 关联规则 频繁项集 FP-GROWTH SPARK 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象