基于Spark改进的最大频繁项集挖掘算法  被引量:8

Improved algorithm for mining maximum frequent itemsets based on Spark

在线阅读下载全文

作  者:焦润海[1] 张谦[1] 陈超[1] 

机构地区:[1]华北电力大学控制与计算机工程学院,北京102206

出  处:《计算机工程与设计》2017年第7期1839-1843,共5页Computer Engineering and Design

摘  要:为解决面向大规模高维数据的频繁项集挖掘问题,针对传统算法的时空复杂度和并行化策略进行优化,实现基于Spark改进的最大频繁项集挖掘算法。结合Spark的分布式框架和DMFIA算法的优点,提出深度路径搜索和长度优先超集检验两项改进方法,进行深度路径递归搜索一次性生成最大频繁项候选集,对候选项集进行长度优先排序并检验超集,降低候选项集的规模和挖掘次数,解决传统最大频繁项集挖掘算法在数据量大、维度高时效率低的问题。实验结果表明,该算法在时间上优于同类算法2-4倍,对数据集规模具有良好的扩展性。To solve the problem of mining frequent itemsets from data with large scale and high dimension,traditional algorithm was optimized from two aspects including time and space complexity and parallelization strategy.A refined algorithm was proposed based on Spark,combining the advantage of Spark distributed framework and DMFIA algorithm,with improvements by depth path search and length-first superset test.The reduction of efficiency in conventional maximum frequent data mining algorithms in large scale and high dimensional datasets was avoided,by utilizing depth-first search algorithm to generate maximum candidate frequent set,and sorting the acquired dataset by length and testing superset cyclically.Experimental results indicate that the proposed algorithm is 2-4 times faster than conventional algorithm and demonstrate its strong adaptability in different datasets of various scales.

关 键 词:频繁模式树 分布式计算 数据挖掘 关联规则 最大频繁项 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象