利用布隆滤波二次拆分的数据倾斜处理算法  

Data skewness handling algorithm using Bloom filter based secondary partition

在线阅读下载全文

作  者:张强 张学文[2] ZHANG Qiang;ZHANG Xue-wen(School of Electrical Information,City College of Science and Technology,Chongqing University,Chongqing 402167,China;College of Mechanical Engineering,Beihua University,Jilin 132013,China)

机构地区:[1]重庆大学城市科技学院电气信息学院,重庆402167 [2]北华大学机械工程学院,吉林吉林132013

出  处:《计算机工程与设计》2021年第2期475-481,共7页Computer Engineering and Design

基  金:国家自然科学基金项目(60875064、61175102)。

摘  要:MapReduce进行大数据分布式计算时,数据集倾斜特性将导致子任务间完成时间差异明显,影响计算性能。提出基于布隆滤波二次拆分的处理算法。考虑硬件配置、计算资源等约束,提出基于布隆滤波的改进Hash分区函数,缩小映射操作的选择范围。在Reduce阶段设计学习自动机分区算法,使用均匀分布抽样获取任务类型与规模,以Reducer范围为约束条件搜索Mapper与Reducer的近似最优映射。与不考虑Reducer异构的常规算法相比,所提算法网络开销与负载变化率可降低2.4%和28%以上,负载均衡度则提升7.31%以上,表明所提算法具有更高的计算效率。In MapReduce,due to the skew feature of data sets,the completion time of each sub task is significantly different,which affects the calculation performance of the final big data processing.Hence,an algorithm with Bloom filter based secondary partition was proposed.Considering the constraints such as hardware configuration and computing resources,an improved Hash partition function based on Bloom filtering was proposed to narrow the selection range in Mapper stage.A partitioning algorithm based on learning automata was designed in the Reduce stage.The uniform distribution sampling algorithm was used to obtain the data task type and scale.The Reducer candidate range was used as the constraint to search the approximate optimal mapping relationship between Mapper and Reducer.The results of the example show that compared with the conventional algorithm which does not consider the heterogeneous characteristics of Reducer,the network overhead and load change rate are reduced by 2.4%and 28%respectively,and the load balance is improved by over 7.31%.Hence,the proposed algorithm has higher computational efficiency.

关 键 词:数据倾斜 布隆滤波 二次拆分 学习自动机 大数据处理 负载变化率 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象