基于任务合并的并行大数据清洗过程优化被引量：48

The Optimization of the Big Data Cleaning Based on Task Merging

作　　者：杨东华[1,2] 李宁宁[1] 王宏志[1] 李建中[1] 高宏[1]

机构地区：[1]哈尔滨工业大学计算机科学与技术学院,哈尔滨150001 [2]哈尔滨工业大学基础与交叉科学研究院,哈尔滨150001

出　　处：《计算机学报》2016年第1期97-108,共12页Chinese Journal of Computers

基　　金：国家"九七三"重点基础研究发展规划项目基金(2012CB316200);国家自然科学基金(61472099;60933001;61272046);国家"八六三"高技术研究发展计划项目基金(2012AA011004);国家博士后基金(20090450126;201003447);国家博士后基金特别资助项目(2013T60372);教育部博士点基金(20102302120054);黑龙江省自然科学基金(F201317)资助

摘　　要：数据质量问题会对大数据的应用产生致命影响,因此需要对存在数据质量问题的大数据进行清洗.MapReduce编程框架可以利用并行技术实现高可扩展性的大数据清洗,然而,由于缺乏有效的设计,在基于MapReduce的数据清洗过程中存在计算的冗余,导致性能降低.因此文中的目的是对并行数据清洗过程进行优化从而提高效率.通过研究,作者发现数据清洗中一些任务往往都运行在同一输入文件上或者利用同样的运算结果,基于该发现文中提出了一种新的优化技术——基于任务合并的优化技术.针对冗余计算和利用同一输入文件的简单计算进行合并,通过这种合并可以减少MapReduce的轮数从而减少系统运行的时间,最终达到系统优化的目标.文中针对数据清洗过程中多个复杂的模块进行了优化,具体来说分别对实体识别模块、不一致数据修复模块和缺失值填充模块进行了优化.实验结果表明,文中提出的策略可以有效提高数据清洗的效率.Data quality issues will result in lethal effects of big data applications, so it is needed to clean the big data with the problem of data quality. MapReduce programming framework can take advantage of parallel technology to achieve high scalability for large data cleaning. However, due to the lack of effective design, redundant computation exists in the cleaning process based on MapReduee, resulting in decreased performance. Therefore, the purpose of this paper is to optimize the parallel data cleaning process to improve efficiency. Through research, we found that some data cleaning tasks are often run on the same input file or using the same calculation results. Based on the discovery this paper presents a new optimization techniquesoptimization techniques based task combinations. By merging redundant computation and several simple calculations for the same input file, we can reduce the number of rounds of MapReduce system thereby reducing the running time, and ultimately achieve system optimization. In this paper, some complex modules of data cleaning process have been optimized, respectively entity recognition module, inconsistent data recovery module, and the module of missing values filling. The experimental results show that the proposed strategy in this paper can effectively improve the efficiency o~ data cleaning.

关键词：大数据多任务优化海量数据数据清洗 HADOOP MAPREDUCE

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于任务合并的并行大数据清洗过程优化被引量：48

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于任务合并的并行大数据清洗过程优化 被引量：48

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于任务合并的并行大数据清洗过程优化被引量：48