面向MapReduce的中间数据传输流水线优化机制被引量：4

Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework

作　　者：张元鸣[1] 虞家睿蒋建波陆佳炜[1] 肖刚[1] ZHANG Yuan-ming;YU Jia-rui;JIANG Jian-bo;LU Jia-wei;XIAO Gang(College of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310023,China)

机构地区：[1]浙江工业大学计算机科学与技术学院,杭州310023

出　　处：《计算机科学》2021年第2期41-46,共6页Computer Science

基　　金：计算机体系结构国家重点实验室开放课题(CARCH201804)。

摘　　要：MapReduce是一种适用于大数据处理的重要并行计算框架,通过在大量集群节点上并行执行多个任务,极大地提高了数据的处理性能。然而,由于中间数据需要等到Mapper任务完成之后才能被发送给Reducer任务,由此导致的大量传输延迟成为MapReduce框架性能的重要瓶颈。为此,文中提出了一种面向MapReduce的中间数据传输流水线优化机制,将有效计算与中间数据传输解耦,以流水线的方式重叠执行各个阶段,有效隐藏数据传输开销。文中还给出了中间数据传输流水线执行机制和实现策略,包括流水线划分、数据细分、数据归并和数据传输粒度等。在公开数据集上对所提中间数据传输流水线优化机制进行了评价,当Shuffle数据量较大时,该优化机制比默认框架的整体性能提高了60.2%。MapReduce is an important parallel computing framework for large data processing,which greatly improves the performance of data processing by performing multiple tasks in parallel on a large number of cluster nodes.However,since the intermediate data needs to wait until the Mapper task is completed,it can be sent to the Reducer task.The massive transmission delay becomes an important bottleneck of the MapReduce framework performance.To this end,an intermediate data transmission pipeline mechanism for MapReduce is proposed.It decouples the effective computation from intermediate data transmission,overlaps each stage in pipeline mode,and effectively hides data transmission delay.The execution mechanism and implementation strategy of the approach are given,including pipeline partition,data subdivision,data merging and data transmission granularity.The proposed mechanism is evaluated on public data sets.When the Shuffle data volume is large,the overall performance improves by 60.2% compared with the default framework.

关键词：MAPREDUCE框架中间数据传输传输延迟流水线溢写文件归并

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向MapReduce的中间数据传输流水线优化机制被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向MapReduce的中间数据传输流水线优化机制 被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

面向MapReduce的中间数据传输流水线优化机制被引量：4