基于Spark的倾斜数据虚拟划分算法  

Virtual partitioning algorithm of skewed data based on Spark

在线阅读下载全文

作  者:李俊丽[1] LI Jun-li(Department of Computer Science and Technology,Jinzhong University,Jinzhong 030619,China)

机构地区:[1]晋中学院计算机科学与技术系,山西晋中030619

出  处:《计算机工程与设计》2021年第8期2271-2276,共6页Computer Engineering and Design

基  金:国家自然科学基金项目(61876122、61602335)。

摘  要:针对基于Spark的类别数据互信息的并行计算在数据倾斜情况下会造成某一个或几个reducer负载过重降低集群性能的现状,重新定义数据倾斜模型来量化由Spark创建的分区之间的数据倾斜度,提出数据虚拟划分算法DVP。通过将同一个键添加随机前缀更改为几个不同的键,减少单个任务处理过量数据的情况;在一个24节点的Spark集群中实现DVP算法,通过与Spark传统的哈希算法DEFH比较,实验验证了DVP算法减轻了Spark Shuffle过程中的数据倾斜,减少了在负载均衡方面的耗时。In view of the current situation that the parallel computation of mutual-information among categorical data based on Spark leads to one or several reducer overload,reducing cluster performance in the case of data skew,the data skew model was redefined to quantify the data skew among partitions created by Spark,and the DVP(data virtual partitioning)algorithm was proposed.By adding a random prefix to the same key and changing it to several different keys,the situation of a single task processing excessive data was reduced.The DVP was implemented in a 24-node Spark cluster,compared with the traditional hash algorithm DEFH in Spark,experimental results show that the proposed DVP reduces not only the data skew in Spark Shuffle,but also the time consumption in load balancing.

关 键 词:数据倾斜 虚拟划分 类别数据 互信息并行计算 负载均衡 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象