一种基于MapReduce模型的高效频繁项集挖掘算法  被引量:9

Efficient Frequent Patterns Mining Algorithm Based on MapReduce Model

在线阅读下载全文

作  者:朱坤[1,2] 黄瑞章[1,2] 张娜娜[1,2] 

机构地区:[1]贵州大学计算机科学与技术学院,贵阳550025 [2]贵州省公共大数据重点实验室,贵阳550025

出  处:《计算机科学》2017年第7期31-37,共7页Computer Science

基  金:国家自然科学基金(61462011;61202089);高等学校博士学科专项科研基金(20125201120006);贵州大学引进人才科研项目(2011015)资助

摘  要:由于互联网技术急速发展及其用户迅速地增加,很多网络服务公司每天不得不处理TB级甚至更大规模的数据量。在如今的大数据时代,如何挖掘有用的信息正变成一个重要的问题。关于数据挖掘(Data Mining)的算法在很多领域中已经被广泛运用,挖掘频繁项集是数据挖掘中最常见且最主要的应用之一,Apriori则是从一个大的数据集中挖掘出频繁项集的最为典型的算法。然而,当数据集比较大或使用单一主机时,内存将会被快速消耗,计算时间也将急剧增加,使得算法性能较低,基于MapReduce的分布式和并行计算则被提出。文中提出了一种改进的MMRA(Matrix MapReduce Algorithm)算法,它通过将分块数据转换成矩阵来挖掘所有的频繁k项集;然后将提出的算法和目前已经存在的两种算法(one-phase算法、k-phase算法)进行比较。采用Hadoop-MapReduce作为实验平台,并行和分布式计算为处理大数据集提供了一个潜在的解决方案。实验结果表明,改进算法的性能优于其他两种算法。Along with the rapid development of Internet and the rapid growing group of users,many Internet services companies have to manage TB size or higher amount of data every day.How to find useful information in this big data era is becoming an important problem.The data mining algorithm has been widely used in many fields,and finding frequent itemsets is one of the most common and primary applications of data mining,and Apriori algorithm is the most typical algorithm for finding frequent itemsets from a big transaction database.However,when the dataset size is rather huge or a single host is used,the memory would be quickly exhausted and the computation time would also increase dramatically,which make the algorithm performance inefficient.Parallel and distributed computing based on the MapReduce framework has been proposed.An improved reformative MMRA(Matrix MapReduce Algorithm)algorithm which should convert the blocked data into matrixs to find all frequent k-itemsets was proposed in this paper,and the proposed algorithm was compared with current two existed algorithms(one-phase algorithm and k-phase algorithm).Adapting Hadoop-MapReduce as the experiment platform,parallel and distributed computing offer a potential solution for processing vast amount of data.Experimental results show that the proposed algorithm outperforms the other two algorithms.

关 键 词:Hadoop MAPREDUCE 分布式计算 数据挖掘 频繁项集挖掘 APRIORI算法 

分 类 号:TP399[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象