面向MapReduce的迭代式数据均衡分区策略  被引量:14

An Iterative Data Partitioning Strategy for MapReduce

在线阅读下载全文

作  者:张元鸣[1] 蒋建波 陆佳炜[1] 徐俊[1] 肖刚[1] ZHANG Yuan-Ming;JIANG Jian-Bo;LU Jia-Wei;XU Jun;XIAO Gang(College of Computer Science and Technology, Zhejiang University of Technology , Hangzhou 310023)

机构地区:[1]浙江工业大学计算机科学与技术学院

出  处:《计算机学报》2019年第8期1873-1885,共13页Chinese Journal of Computers

基  金:国家自然科学基金(61379017);浙江省公益技术项目(2017C31014);计算机体系结构国家重点实验室开发课题(CARCH201804)资助~~

摘  要:MapReduce是一种适用于大数据处理的重要并行计算框架.然而,由于难以提前全面获得中间数据的分布规律,默认的数据分区策略往往会造成Reducer端的数据倾斜,会直接影响MapReduce的整体性能.为了实现数据均衡分区,本文提出一种迭代式数据均衡分区策略,将每个Mapper节点要处理的数据块细分后以迭代方式循环处理,根据已迭代轮次的微分区分配结果决定当前迭代轮次的微分区分配方案,以不断调整历次迭代产生的数据倾斜,逐步实现数据均衡分区.给出了迭代式数据分区策略的分配时机、分配准则、分配评价模型和分配算法.基于公开的数据集,对迭代式数据均衡分区策略进行了详细测评,结果表明,该策略能够得到更均衡的数据分区结果,当数据集本身倾斜比较显著时,MapReduce整体性能比默认分区策略平均提高了11.1%和19.7%.MapReduce is an important parallel computing framework applied to big data.It can greatly improves data processing performance by running multiple tasks in parallel on a large number of cluster nodes.Intermediate data partitioning is a key problem that needs to be carefully handled in Shuffle phase,since the partitioning result will directly affect load balancing among Reducers.However,due to the difficulties of fully acquiring the distribution of intermediate data in advance,default data partitioning strategy will generally lead to data skew,which greatly limits the overall performance of MapReduce.This paper proposes an iterative data partitioning strategy for MapReduce.The data block processed in Map phase is further subdivided into fine-grained block(FGB)that will be iteratively processed.According to the partitioning results of previous micro-partitions(MPs),the partitioning plan of current MP is made.The data skew generated by previous MPs can be continuously eliminated,and balanced data partitioning can be finally achieved.In this strategy,data allocation begins after one FGB is completely processed by Mapper and all MPs contained in the FGB are generated.The condition to perform data allocation is that there are unallocated MPs in current iteration.The MPs with the same key will be allocated to the same Reducer in default.The MPs with larger factors are speculatively selected for data allocation in each iteration.The advantage of that is the MPs with larger factors contain more tuples and can be transmitted as early as possible.In addition,the Mapper also can work along with the data transmission in parallel.Vectors are defined to record profiling information of MP in every iteration and global profiling information of allocated MPs in all nodes.Furthermore,two algorithms are implemented to allocate the selected MPs to several Segments in current iteration based on existing partitioning results of previous MPs.These algorithms can make data balanced as well as possible after each iteration when future dat

关 键 词:MAPREDUCE 大数据 数据倾斜 迭代式数据分区 微分区 均衡分区 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象