基于Spark的Apriori并行算法优化实现被引量：12

Optimization of Apriori Parallel Algorithm Based on Spark

作　　者：王青[1] 谭良[1,2] 杨显华[3] WANG Qingl TAN Liang YANG Xianhua(College of Computer Science, Sichuan Normal University, Chengdu 610101, China Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China Sichuan Institute of Computer Sciences, Chengdu 610041, China)

机构地区：[1]四川师范大学计算机科学学院,四川成都610101 [2]中国科学院计算技术研究所,北京100190 [3]四川省计算机研究院,四川成都610041

出　　处：《郑州大学学报（理学版）》2016年第4期60-64,共5页Journal of Zhengzhou University:Natural Science Edition

基　　金：国家自然科学基金资助项目(61373162);四川省科技支撑项目(2014GZ007)

摘　　要：针对传统Apriori算法处理速度和计算资源的瓶颈,以及Hadoop平台上Map-Reduce计算框架不能处理节点失效、不能友好支持迭代计算以及不能基于内存计算等问题,提出了Spark下并行关联规则优化算法.该算法只需两次扫描事务数据库,并充分利用Spark内存计算的RDD存储项集.与传统Apriori算法相比,该算法扫描事务数据库的次数大大降低;与Hadoop下Apriori算法相比,该算法不仅简化计算,支持迭代,而且通过在内存中缓存中间结果减少I/O花销.实验结果表明,该算法可以提高关联规则算法在大数据规模下的挖掘效率.In view of the bottleneck of traditional Apriori algorithm in processing speed and computing re-sources, and that Map-Reduce on Hadoop could not handle node failures, friendly support iterative calcu-lation, and calculate based on memory issues ,a parallel association rule optimization algorithm based on Spark was proposed. The optimization algorithm only needed to scan the transaction database twice and it took advantage of Spark’ s RDD storage structure. By comparing with the traditional Apriori and Apriori based on Hadoop, analysis showed that Apriori based on Spark more greatly reduced the number of scan database than that of traditional Apriori, and it used less I/O overhead than Apriori based on Hadoop, because it supported storing temporary results in memory and iterative calculation. Experimental results showed that Apriori based on Spark performed effectively on big data for mining association rules.

关键词：并行化数据挖掘关联规则

分类号：TP301.6[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Spark的Apriori并行算法优化实现被引量：12

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Spark的Apriori并行算法优化实现 被引量：12

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于Spark的Apriori并行算法优化实现被引量：12