基于Hadoop平台的并行DHP数据分析方法  被引量:4

Data analysis method for parallel DHP based on Hadoop

在线阅读下载全文

作  者:杨燕霞[1] 冯林[1,2] 

机构地区:[1]四川师范大学计算机科学学院,成都610101 [2]四川师大科技园发展有限公司,成都610066

出  处:《计算机应用》2016年第12期3280-3284,3291,共6页journal of Computer Applications

基  金:国家科技支撑计划项目(2014BAH11F01;2014BAH11F02);四川省科技支撑计划项目(15GZ0079)~~

摘  要:由候选项集G2生成频繁2-项集岛是关联规则Apriori算法的一个瓶颈。直接哈希修剪(DHP)算法利用一个生成的Hash表见H2减G2中无用的候选项集,以此提高厶的生成效率。但传统DHP算法是一个串行算法,不能有效处理较大规模数据。针对这一问题,提出DHP的并行化算法——H_DHP。首先,对DHP算法并行化策略的可行性进行了理论分析与证明;其次,基于Hadoop平台,把Hash表以的生成以及频繁项集L1、L3~Lk的生成方法进行了并行实现,并借助Hbase数据库生成关联规则。仿真实验结果表明:与传统DHP算法相比,H_DHP算法在数据的处理时间效率、处理数据集的规模大小,以及加速比和可扩展性等方面都有较好的性能。It is a bottleneck of Apriori algorithm for mining association rules that the candidate set C2 is used to generate the frequent 2-item set L2. In the Direct Hashing and Pruning (DHP) algorithm, a generated Hash table H2 is used to delete the unused candidate item sets in C2 for improving the efficiency of generating L2. However, the traditional DI-IP is a serial algorithm, which cannot effectively deal with large scale data. In order to solve the problem, a DHP parallel algorithm, termed H DHP algorithm, was proposed. First, the feasibility of parallel strategy in DHP was analyzed and proved theoretically. Then, the generation method for the Hash table H2 and frequent item sets L1, L3 - Lk was developed in parallel based on Hadoop, and the association rules were generated by Hbase database. The simulation experimental results show that, compared with the DHP algorithm, the H_DHP algorithm has better performance in the processing efficiency of data, the size of the data set, the speedup and scalability.

关 键 词:HADOOP HASH表 APRIORI算法 直接哈希修剪算法 

分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象