检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:靳丹[1,2] 张磊 王洪军[1,2] 王宝会[1,2]
机构地区:[1]国网甘肃省电力公司信息通信公司,甘肃730050 [2]北京航空航天大学软件学院,北京100191
出 处:《网络新媒体技术》2015年第5期33-38,共6页Network New Media Technology
摘 要:构建和运行数据仓库的关键步骤是ETL,而ETL中的最关键步骤就是数据的清洗和转换。在当今数据爆炸式增长的背景下,数据清洗与转换的挑战主要来自于源数据的复杂性和数据量的庞大,针对数据庞大的问题,目前Hadoop体系的Mapreduce框架已经成为海量数据处理领域的事实标准。本文主要分析在大数据环境下数据清洗工作中数据来源的复杂性问题,并针对这些问题提出了基于Hadoop的简单的可扩展的数据清洗框架,可以让本框架的使用者只需要用少量代码完成基于Mapreduce的海量复杂数据的清洗工作,Mapreduce的复杂性对开发者透明。并以某互联网公司使用此框架收集的用户行为日志数据作为示例,在示例中,此框架相比之前的解决方案极大地提高了海量数据清洗的准确性和效率。另外,本框架还可以应用于海量的非结构化数据的清洗。Building and running the data warehouse ETL is the kep step, while in ETL data cleaning and conversion is the most critical step. Under the background of current data explosion, the challenges of data cleaning and conversion is mainly from the complexity of data and huge amount of data. For the problem of huge amount of data, mapreduce framework of Hadoop system has become the stand- ard in processing massive data at present. In this article, we analyzed the complexity of the data source in the data cleaning work in the big data environment and proposed a Hadoop - based scalable massive data cleaning framework which is easy to use and transparent to developers, it can do some complex massive data cleaning work with only a few codes. A user behavior log data from an internet company was used as an example for intro- ducing this framework in addition, in this example, compared to the previous solution, this framework has greatly improved the accura- cy and efficiency of mass data cleaning, this framework can also be widely used in cleaning the mass of unstructured data.
关 键 词:数据清洗 HADOOP MAPREDUCE 大数据
分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28