基于Hadoop的大数据清洗框架设计与应用  被引量:6

Design and Application of Hadoop Based Data Cleaning Framework

在线阅读下载全文

作  者:靳丹[1,2] 张磊 王洪军[1,2] 王宝会[1,2] 

机构地区:[1]国网甘肃省电力公司信息通信公司,甘肃730050 [2]北京航空航天大学软件学院,北京100191

出  处:《网络新媒体技术》2015年第5期33-38,共6页Network New Media Technology

摘  要:构建和运行数据仓库的关键步骤是ETL,而ETL中的最关键步骤就是数据的清洗和转换。在当今数据爆炸式增长的背景下,数据清洗与转换的挑战主要来自于源数据的复杂性和数据量的庞大,针对数据庞大的问题,目前Hadoop体系的Mapreduce框架已经成为海量数据处理领域的事实标准。本文主要分析在大数据环境下数据清洗工作中数据来源的复杂性问题,并针对这些问题提出了基于Hadoop的简单的可扩展的数据清洗框架,可以让本框架的使用者只需要用少量代码完成基于Mapreduce的海量复杂数据的清洗工作,Mapreduce的复杂性对开发者透明。并以某互联网公司使用此框架收集的用户行为日志数据作为示例,在示例中,此框架相比之前的解决方案极大地提高了海量数据清洗的准确性和效率。另外,本框架还可以应用于海量的非结构化数据的清洗。Building and running the data warehouse ETL is the kep step, while in ETL data cleaning and conversion is the most critical step. Under the background of current data explosion, the challenges of data cleaning and conversion is mainly from the complexity of data and huge amount of data. For the problem of huge amount of data, mapreduce framework of Hadoop system has become the stand- ard in processing massive data at present. In this article, we analyzed the complexity of the data source in the data cleaning work in the big data environment and proposed a Hadoop - based scalable massive data cleaning framework which is easy to use and transparent to developers, it can do some complex massive data cleaning work with only a few codes. A user behavior log data from an internet company was used as an example for intro- ducing this framework in addition, in this example, compared to the previous solution, this framework has greatly improved the accura- cy and efficiency of mass data cleaning, this framework can also be widely used in cleaning the mass of unstructured data.

关 键 词:数据清洗 HADOOP MAPREDUCE 大数据 

分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象