基于Spark的并行频繁项集挖掘算法被引量：3

Parallel algorithm for mining frequent item sets based on Spark

作　　者：毛伊敏[1] 吴斌许春冬[1] 张茂省 MAO Yimin;WU Bin;XU Chundong;ZHANG Maosheng(School of Information Engineering,Jiangxi University of Science and Technology,Ganzhou 341000,China;School of Human Settlements and Architectural Engineering,Xi’an Jiaotong University,Xi’an 710000,China)

机构地区：[1]江西理工大学信息工程学院,江西赣州341000 [2]西安交通大学人居环境与建筑工程学院,陕西西安710000

出　　处：《计算机集成制造系统》2023年第4期1267-1283,共17页Computer Integrated Manufacturing Systems

基　　金：国家自然科学基金资助项目(41562019,11864016);国家重点研发计划资助项目(2018YFC1504705)。

摘　　要：针对大数据环境下基于Spark的频繁模式增长(FP-Growth)算法存在创建条件频繁模式树(FP-tree)时空效率低,节点间通信开销大,以及冗余搜索等问题,提出了基于Spark的并行频繁项集挖掘算法(PAFMFI-Spark)。首先,该算法提出非负矩阵分解策略(SNMF),通过提供支持度计数查询和分解储存支持度计数的矩阵,解决了创建条件FP-tree的时空效率低的问题;其次,提出基于遗传算法的分组策略(GS-GA),均衡分配频繁1项集至各节点,解决了节点间的通信开销大的问题;最后,提出高效缩减树结构策略(ERTSS),缩减FP-tree树结构,解决了冗余搜索的问题。实验结果验证了PAFMFI-Spark算法的可行性以及相较于其他挖掘算法的性能优势,所提算法能有效适应各种数据的频繁项集挖掘。Aiming at the problem of low space-time efficiency in creating conditional Frequent Pattern Tree(FP-tree),high communication overhead between nodes and redundant search in the Spark-based FP-Growth algorithm,a Parallel Algorithm For Mining Frequent Itemset based on Spark(PAFMFI-Spark)was proposed.A Strategy of Non-Negative Matrix Factorization(SNMF)was proposed,which provided the query of support counts and decomposed the matrix of support counts,thereby solving the problem of low space-time efficiency in creating conditional FP-tree.A Grouping Strategy based on Genetic Algorithm(GS-GA)was proposed,which evenly grouped frequent 1 item sets to solve the problem of high communication overhead between nodes.An Efficiently Reduce Tree Structure Strategy(ERTSS)was proposed,which reduced the structure of FP-tree to solve the redundant search problem.The feasibility of the PAFMFI-Spark algorithm was verified by the experiment,and its performance advantage was proved by comparing with other mining algorithms,which could effectively process frequent itemset mining of various data.

关键词：大数据 Spark框架并行频繁项集挖掘频繁模式增长算法非负矩阵分解

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Spark的并行频繁项集挖掘算法被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Spark的并行频繁项集挖掘算法 被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于Spark的并行频繁项集挖掘算法被引量：3