基于Spark改进的最大频繁项集挖掘算法被引量：8

Improved algorithm for mining maximum frequent itemsets based on Spark

出　　处：《计算机工程与设计》2017年第7期1839-1843,共5页Computer Engineering and Design

摘　　要：为解决面向大规模高维数据的频繁项集挖掘问题,针对传统算法的时空复杂度和并行化策略进行优化,实现基于Spark改进的最大频繁项集挖掘算法。结合Spark的分布式框架和DMFIA算法的优点,提出深度路径搜索和长度优先超集检验两项改进方法,进行深度路径递归搜索一次性生成最大频繁项候选集,对候选项集进行长度优先排序并检验超集,降低候选项集的规模和挖掘次数,解决传统最大频繁项集挖掘算法在数据量大、维度高时效率低的问题。实验结果表明,该算法在时间上优于同类算法2-4倍,对数据集规模具有良好的扩展性。To solve the problem of mining frequent itemsets from data with large scale and high dimension,traditional algorithm was optimized from two aspects including time and space complexity and parallelization strategy.A refined algorithm was proposed based on Spark,combining the advantage of Spark distributed framework and DMFIA algorithm,with improvements by depth path search and length-first superset test.The reduction of efficiency in conventional maximum frequent data mining algorithms in large scale and high dimensional datasets was avoided,by utilizing depth-first search algorithm to generate maximum candidate frequent set,and sorting the acquired dataset by length and testing superset cyclically.Experimental results indicate that the proposed algorithm is 2-4 times faster than conventional algorithm and demonstrate its strong adaptability in different datasets of various scales.

关键词：频繁模式树分布式计算数据挖掘关联规则最大频繁项

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Spark改进的最大频繁项集挖掘算法被引量：8

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Spark改进的最大频繁项集挖掘算法 被引量：8

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于Spark改进的最大频繁项集挖掘算法被引量：8