基于哈希存储与事务加权的并行Apriori改进算法  被引量:9

Improved Parallel Apriori Algorithm Based on Hash Storageand Transaction Weighting

在线阅读下载全文

作  者:李洁 朱洪亮[1,2] 陈玉玲 辛阳[1,2] LI Jie;ZHU Hongliang;CHEN Yuling;XIN Yang(School of Cyberspace Security,Beijing University of Posts and Telecommunications,Beijing 100876,China;Guizhou Provincial Key Laboratory of Public Big Data,Guizhou University,Guiyang 550025,China)

机构地区:[1]北京邮电大学网络空间安全学院,北京100876 [2]贵州大学贵州省公共大数据重点实验室,贵阳550025

出  处:《计算机工程》2020年第11期109-116,共8页Computer Engineering

基  金:国家重点研发计划(2017YFB0802300);贵州省科技重大专项(20183001);贵州省公共大数据重点实验室开放课题(2018BDKFJJ008,2018BDKFJJ020)。

摘  要:Apriori算法能够挖掘事物之间的关联关系,但传统Apriori算法每计算一次候选集的支持度,都需要遍历原始事务数据库,多次扫描数据库导致其效率较低。为此,提出一种基于哈希存储与事务加权的改进算法。通过哈希存储的去重特性对事务进行去重,以减少冗余计算。将项目与项集的映射存储到哈希结构中,避免计算候选集的支持度时多次扫描事务数据库。同时开启多个线程,并行计算候选集的支持度,从而提高Apriori算法的运行效率。在开源数据集上的实验结果表明,当数据集中事务条数以及重复事务数越多时,该算法相较于传统Apriori算法的性能提升越明显,其运行时间与FP-Growth算法相近但避免了FP-Growth算法内存占用过大的问题。The Apriori algorithm can mine the association relationships between things,but the traditional Apriori algorithm needs to traverse the original transaction database every time the support of the candidate set is calculated,which reduces the efficiency of the algorithm.To address the problem,this paper proposes an improved algorithm based on hash storage and transaction weighting.The algorithm uses the deduplication feature of hash storage to deduplicate the transactions to reduce redundant calculations.At the same time,the mapping between the transaction set and the itemset is stored in the hash structure to avoid scanning the transaction database for multiple times during the calculation of the support of the candidate set.In addition,the support of the candidate set is calculated in parallel using multiple threads to improve the efficiency of the Apriori algorithm.Experimental results on open-source datasets show that the performance improvement of the proposed algorithm over the traditional Apriori algorithm increases with the number of transactions and repeated transactions in the dataset.Its running time is similar to that of the FP-Growth algorithm while the excessive memory consumption is avoided.

关 键 词:关联规则 频繁项集 哈希存储 事务加权 并行计算 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象