基于Hadoop平台的LDA算法的并行化实现  被引量:3

Parallel implementation of LDA algorithm based on Hadoop

在线阅读下载全文

作  者:张钊[1,2,3] 张新峰[1,2,3] 郑楠[1,2,3] 贵明俊 

机构地区:[1]北京工业大学电子信息与控制工程学院,北京100124 [2]数字社区教育部工程研究中心,北京100124 [3]城市轨道交通北京实验室,北京100124

出  处:《计算机工程与科学》2016年第2期231-239,共9页Computer Engineering & Science

基  金:北京市属高等学校高层次人才引进与培养计划项目(CIT&TCD201504018)

摘  要:随着互联网的飞速发展,需要处理的数据量不断增加,在互联网数据挖掘领域中传统的单机文本聚类算法无法满足海量数据处理的要求,针对在单机情况下,传统LDA算法无法分析处理大规模语料集的问题,提出基于MapReduce计算框架,采用Gibbs抽样方法的并行化LDA主题模型的建立方法。利用分布式计算框架MapReduce研究了LDA主题模型的并行化实现,并且考察了该并行计算程序的计算性能。通过对Hadoop并行计算与单机计算进行实验对比,发现该方法在处理大规模语料时,能够较大地提升算法的运行速度,并且随着集群节点数的增加,在加速比方面也有较好的表现。基于Hadoop平台并行化地实现LDA算法具有可行性,解决了单机无法分析大规模语料集中潜藏主题信息的问题。With the rapid development of the Internet, the amount of data which needs to be dealt with is increasing constantly. The traditional stand-alone text clustering algorithm cannot meet the requirements of large-scale data processing in the field of data mining. In order to solve the problem that stand-alone LDA algorithm is incapable of analyzing and dealing with large-scale data, we propose a distributed parallel LDA program using Gibbs sampling based on the MapReduce framework. By utilizing the MapReduce distributed computing framework, we study the distributed implementation of LDA topic model, and test the performance of the distributed computing programs. Through the comparison tests between distributed computing based on Hadoop and stand-alone computing, we find out that the method can enhance the running speed of the algorithm when dealing with large-scale data. As the number of clustering nodes is increasing, the proposal also has good speedup performance. The parallel implementation of the LDA algorithm is feasible, which can solve the problem that stand-alone LDA model is incapable of analyzing and dealing with the latent topic information of large-scale data.

关 键 词:HADOOP MAPREDUCE LDA主题模型 Gibbs 中文分词 并行计算 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象