基于Hadoop云平台的并行数据挖掘方法被引量：38

Parallel Approach in Data Mining Based on Hadoop Cloud Platform

机构地区：[1]中科院计算技术研究所智能信息处理重点实验室,北京100190 [2]中国科学院大学,北京100039

出　　处：《系统仿真学报》2013年第5期936-944,共9页Journal of System Simulation

基　　金：国家自然科学基金(61035003;61072085;61202212;60933004);国家973项目(2013CB329502);国家863高技术研究发展计划课题(2012AA011003);国家科技支撑计划(2012BA107B02)

摘　　要：业界已经开始运用云平台来处理海量高维数据,将各种异构系统仿真为一个系统,其中在Hadoop环境进行数据挖掘会遇到数据模型的全局性、HDFS的文件随机写操作、数据生命周期短等问题。为解决这些问题,在Hadoop上实现高效海量数据挖掘,提出了在Hadoop上一种高效数据挖掘框架,利用数据库来模拟链表结构,管理挖掘出来的知识,提供了树形结构、图模型的分布式计算方法;在此基础上实现一个统计算法——Yscore分箱算法,以及决策树和KD树的建树算法;并利用Vega云对Hadoop集群进行仿真。实验数据表明该框架和算法实用可行,且可能拓展与数据挖掘之外的其他领域。The cloud platform has been dealt in industry with large-scale high-dimensional data. A variety of heterogeneous systems have been simulated as one system, in which data mining on Hadoop will encounter the issues, such as the globalization of data models, the random write operations of HDFS files, and the duration of data life. For practical large-scale high-dimensional data mining, an efficient data mining framework on Hadoop was proposed to solve these problems, which used databases to simulate the linked list structure, and provided a distributed algorithm for structures of tree and graph model. Based on it, a statistical algorithm-Yscore binning - was proposed, as well as the DB-tree and KD-tree building algorithm. The Vega cloud was used as a simulation of Hadoop cluster. The experimental data shows that the framework and the algorithm is practical and feasible, and may be expanded to other areas outside of data mining.

关键词：并行数据挖掘决策树算法 KD树算法 JPA 云计算

分类号：TP391.9[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Hadoop云平台的并行数据挖掘方法被引量：38

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Hadoop云平台的并行数据挖掘方法 被引量：38

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于Hadoop云平台的并行数据挖掘方法被引量：38