基于Spark的并行FP-Growth算法优化与实现  被引量:8

OPTIMIZATION AND IMPLEMENTATION OF PARALLEL FP-GROWTH ALGORITHM BASED ON SPARK

在线阅读下载全文

作  者:陆可[1] 桂伟[1] 江雨燕[1] 杜萍萍[1] 

机构地区:[1]安徽工业大学管理科学与工程学院,安徽马鞍山243000

出  处:《计算机应用与软件》2017年第9期273-278,共6页Computer Applications and Software

基  金:国家自然科学基金项目(71371013);安徽工业大学校青年教师科研基金项目(QZ201420);安徽省教育厅自然科学基金项目(KJ2016A087)

摘  要:频繁模式挖掘作为模式识别的重要问题,一直受到研究者的广泛关注。FP-Growth算法因其高效快速的特点,被大量应用于频繁模式的挖掘任务中。然而,该算法依赖于内存运行的特性,使其难以适应大规模数据计算。针对上述问题,围绕大规模数据集下频繁模式挖掘展开研究,基于Spark框架,通过对支持度计数和分组过程的优化改进了FP-Growth算法,并实现了算法的分布式计算和计算资源的动态分配。运算过程中产生的中间结果均保存在内存中,因此有效减少数据的I/O消耗,提高算法的运行效率。实验结果表明,经优化后的算法在面向大规模数据时要优于传统的FP-Growth算法。As an important problem of pattern recognition,frequent itemsets mining has been paid more and more attention by researchers. FP-Growth algorithm is widely used in frequent pattern mining because of its high efficiency and fast performance. However,the algorithm relies on the characteristics of local memory operation,making it difficult to adapt to large-scale data calculation. To solve these problems,this paper focuses on the research of frequent itemsets mining in a distributed environment. The FP-Growth algorithm which based on the Spark framework was improved by optimizing the support count and grouping process. Furthermore,the distributed computation and the dynamic allocation of computing resources were realized. The intermediate results were stored in the memory,so the I/O consumption was reduced and the efficiency of the algorithm was improved. The experimental results show that the improved distributed FP-Growth algorithm is superior to the traditional FP-Growth algorithm for large-scale data.

关 键 词:频繁模式挖掘 FP-GROWTH算法 分布式计算 Spark框架 

分 类 号:TP3[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象