基于项编码的分布式频繁项集挖掘算法  被引量:4

Novel distributed itemset mining algorithm based on item encoding

在线阅读下载全文

作  者:郑静益 邓晓衡[1] Zheng Jingyi;Deng Xiaoheng(College of Software, Central South University, Changsha 410075, China)

机构地区:[1]中南大学软件学院,长沙410075

出  处:《计算机应用研究》2019年第4期1059-1063,1067,共6页Application Research of Computers

基  金:中南大学研究生科研创新项目(2017zzts612)

摘  要:Apriori算法是解决频繁项集挖掘最常用的算法之一,但多轮迭代扫描完整数据集的计算方式,严重影响算法效率且难以并行化处理。随着数据规模的持续增大,这一问题日益严重。针对这一问题,提出了一种基于项编码和Spark计算框架的Apriori并行化处理方法——IEBDA算法,利用项编码完整保存项集信息,在不重复扫描完整数据集的情况下完成频繁项集挖掘,同时利用Spark的广播变量实现并行化处理。与其他分布式Apriori算法在不同规模的数据集上进行性能比较,发现IEBDA算法从第一轮迭代后加速效果明显。结果表明,该算法可以提高大数据环境下多轮迭代的频繁项集挖掘效率。Apriori is one of the most widely used algorithm to discover frequent patterns.However,scanning the entire dataset in each iteration makes this algorithm inefficient and hard to be in parallel.With the size of datasets gets larger continuously,this problem is becoming more and more serious.Therefore,this paper proposed a novel algorithm called IEBDA.This algorithm was a kind of parallelization of Apriori based on item encoding and Spark framework.Saving information of each itemset by item encoding so that it could finish frequent itemset mining without scanning the whole dataset repeatedly.The broadcast variables of Spark enabled this algorithm to be in parallel.Compared with other distributed Apriori algorithms on datasets with different sizes,the acceleration of mining after the first iteration was obvious.The results show that this algorithm efficiently improves the multi-iteratively frequent itemset mining in big data environment.

关 键 词:频繁项集挖掘 APRIORI算法 大数据 分布式计算 

分 类 号:TP391[自动化与计算机技术—计算机应用技术] TP301.6[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象