基于Spark的并行频繁项集挖掘算法  被引量:3

Parallel algorithm for mining frequent item sets based on Spark

在线阅读下载全文

作  者:毛伊敏[1] 吴斌 许春冬[1] 张茂省 MAO Yimin;WU Bin;XU Chundong;ZHANG Maosheng(School of Information Engineering,Jiangxi University of Science and Technology,Ganzhou 341000,China;School of Human Settlements and Architectural Engineering,Xi’an Jiaotong University,Xi’an 710000,China)

机构地区:[1]江西理工大学信息工程学院,江西赣州341000 [2]西安交通大学人居环境与建筑工程学院,陕西西安710000

出  处:《计算机集成制造系统》2023年第4期1267-1283,共17页Computer Integrated Manufacturing Systems

基  金:国家自然科学基金资助项目(41562019,11864016);国家重点研发计划资助项目(2018YFC1504705)。

摘  要:针对大数据环境下基于Spark的频繁模式增长(FP-Growth)算法存在创建条件频繁模式树(FP-tree)时空效率低,节点间通信开销大,以及冗余搜索等问题,提出了基于Spark的并行频繁项集挖掘算法(PAFMFI-Spark)。首先,该算法提出非负矩阵分解策略(SNMF),通过提供支持度计数查询和分解储存支持度计数的矩阵,解决了创建条件FP-tree的时空效率低的问题;其次,提出基于遗传算法的分组策略(GS-GA),均衡分配频繁1项集至各节点,解决了节点间的通信开销大的问题;最后,提出高效缩减树结构策略(ERTSS),缩减FP-tree树结构,解决了冗余搜索的问题。实验结果验证了PAFMFI-Spark算法的可行性以及相较于其他挖掘算法的性能优势,所提算法能有效适应各种数据的频繁项集挖掘。Aiming at the problem of low space-time efficiency in creating conditional Frequent Pattern Tree(FP-tree),high communication overhead between nodes and redundant search in the Spark-based FP-Growth algorithm,a Parallel Algorithm For Mining Frequent Itemset based on Spark(PAFMFI-Spark)was proposed.A Strategy of Non-Negative Matrix Factorization(SNMF)was proposed,which provided the query of support counts and decomposed the matrix of support counts,thereby solving the problem of low space-time efficiency in creating conditional FP-tree.A Grouping Strategy based on Genetic Algorithm(GS-GA)was proposed,which evenly grouped frequent 1 item sets to solve the problem of high communication overhead between nodes.An Efficiently Reduce Tree Structure Strategy(ERTSS)was proposed,which reduced the structure of FP-tree to solve the redundant search problem.The feasibility of the PAFMFI-Spark algorithm was verified by the experiment,and its performance advantage was proved by comparing with other mining algorithms,which could effectively process frequent itemset mining of various data.

关 键 词:大数据 Spark框架 并行频繁项集挖掘 频繁模式增长算法 非负矩阵分解 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象