检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张素琪[1] 孙云飞 武君艳 顾军华 Zhang Suqi;Sun Yunfei;Wu Junyan;Gu Junhua(School of Information Engineering, Tianjin University of Commerce, Tianjin 300134, China;School of Artificial Intelligence and Data Science Institute, Hebei University of Technology, Tianjin 300401, China;Hebei Key Laboratory of Big Data Computing, Tianjin 300401, China)
机构地区:[1]天津商业大学信息工程学院,天津300134 [2]河北工业大学人工智能与数据科学学院,天津300401 [3]河北省大数据计算重点实验室,天津300401
出 处:《计算机应用与软件》2019年第2期24-28,143,共6页Computer Applications and Software
基 金:河北省科技计划项目(17210305D);天津市科技计划项目(15ZXHLGX00130;16ZXHLSF0023)
摘 要:关联规则挖掘是数据挖掘领域的重要研究方向之一。频繁项集的挖掘是关联规则挖掘的第一步,也是最重要的步骤。FP-Growth(Frequent Pattern-Growth)算法因其挖掘效率以及空间复杂度方面的优势被广泛应用于频繁项集挖掘任务中。面对海量数据,FP-Growth算法挖掘效率变得极低甚至失效。在Hadoop大数据平台上实现的基于MapReduce框架的并行FP-Growth算法——PFP算法解决在处理大规模数据时传统算法失效的问题,但是由于其将每次执行之后的中间结果输出到磁盘,降低算法执行效率。为提高并行FP-Growth算法执行效率,提出一种基于Spark的SPFPG算法。该算法运用负载均衡思想对分组策略进行改进,综合考虑分区计算量和FP-Tree规模两个因素,保证每个组之间负载总和近似相等。在Spark上实现FP-Growth算法——SFPG算法的基础上,实现优化后的SPFPG算法。实验结果表明,SPFPG算法相比SFPG算法挖掘效率更高,且算法具有良好的扩展性。Association rule mining is one of the important research directions in data mining.The mining of frequent itemsets is the first and most important step in association rule mining.FP-Growth algorithm is widely used in frequent itemsets mining tasks because of its mining efficiency and its spatial complexity.Faced with big data, mining efficiency of FP-Growth algorithm becomes extremely low or even invalid.PFP algorithm, the parallel FP-Growth algorithm based on MapReduce framework implemented on Hadoop, solves the problem of failure of traditional algorithms when dealing with large-scale data.However, the efficiency of the algorithm is reduced because it output the intermediate results to disk after each execution.In order to improve the execution efficiency of parallel FP-Growth algorithm, a Spark-based SPFPG algorithm was proposed in the paper.The algorithm adopted the load balancing idea to improve the packet strategy. Considering the partition calculation and FP-Tree size, the sum of loads among each group was approximately equal. On the basis of SFPG, FP-Growth implemented on Spark, the optimized SPFPG algorithm was realized.The experimental results show that SPFPG algorithm is more efficient than SFPG algorithm and it has a good scalability.
关 键 词:大数据平台 关联规则 频繁项集 FP-GROWTH SPARK
分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.229