基于优先填补策略的Spark数据均衡分区方法

First Filling Strategy-Based Partitioning Method to Balance Data in Spark

作　　者：何玉林吴东彤 Philippe Fournier-Viger 黄哲学 HE Yu-lin;WU Dong-tong;Philippe Fournier-Viger;HUANG Zhe-xue(Guangdong Laboratory of Artificial Intelligence and Digital Economy(Shenzhen),Shenzhen,Guangdong 518107,China;College of Computer Science and Software Engineering,Shenzhen University,Shenzhen,Guangdong 518060,China)

机构地区：[1]人工智能与数字经济广东省实验室(深圳),广东深圳518107 [2]深圳大学计算机与软件学院,广东深圳518060

出　　处：《电子学报》2024年第10期3322-3335,共14页Acta Electronica Sinica

基　　金：深圳市科技重大专项项目(No.202302D074);广东省自然科学基金面上项目(No.2023A1515011667);深圳市基础研究面上项目(No.JCYJ20210324093609026);广东省基础与应用基础研究基金粤深联合基金重点项目(No.2023B1515120020)。

摘　　要：Spark作为基于内存计算的分布式大数据处理框架,运行速度快且通用性强.在任务计算过程中,Spark的默认分区器HashPartitioner在处理倾斜数据时,容易产生各个分区数据量不平衡的情况,导致资源利用率低且运行效率差.现存的Spark均衡分区改进方法,例如多阶段分区、迁移分区和采样分区等,大多存在尺度把控难、通信开销成本高、对采样过度依赖等缺陷.为改善上述问题,本文提出了一种基于优先填补策略的分区方法,同时考虑了样本数据和非样本数据的分配,以便实现对全部数据的均衡分区.该方法在对数据采样并根据样本信息估算出每个键的权值后,将键按照权值大小降序排列,依次将键在满足分区容忍度的条件下分配到前面的分区中,为未被采样的键预留后面的分区空间,以获得针对样本数据的分区方案.Spark根据分区方案对样本中出现的键对应的数据进行分区,没有出现的键对应的数据则直接映射到可分配的最后一个分区中.实验结果表明,新分区方法能够有效实现Spark数据的均衡分区,在美国运输统计局发布的真实航空数据集上,基于该方法设计的优先填补分区器的总运行时间比HashPartitioner平均缩短了15.3%,比现有的均衡数据分区器和哈希键值重分配分区器分别平均缩短了38.7%和30.2%.Spark is a distributed big data processing framework based on in-memory computing,which has the advantages of fast running speed and strong versatility.When conducting the computation task,Spark’s default partitioner HashPartitioner is easy to generate data skewing among partitions.It results in low resource utilization and poor operating efficiency.Most of the existing Spark balanced partitioning methods,such as multi-stage partitioning,migration partitioning,and sampling partitioning,have defects of scale control difficulty,high communication overhead,and excessive sampling dependence.In order to solve the above-mentioned problems,we propose a partitioning method based on first filling strategy,which considers the allocations of sample data and non-sample data at the same time,so as to achieve a balanced data partitioning.After sampling the data and estimating the weight of each key according to the sample information,the keys are sorted in descending order according to the weights.The keys are in turn assigned to the previous partitions if their additions can satisfy the partition tolerance,and the space of the last partition is reserved for the keys that are not sampled,so as to obtain the partitioning plan for the sample data.Spark partitions the data corresponding to the keys that appear in the sample according to the partitioning plan,and the data of other keys that do not appear is directly allocated to the last data partition available.The experimental results show that the new method can effectively achieve balanced partitioning for Spark data.On the real datasets from Bureau of Transportation Statistics,compared with HashPartitioner,the total running time of first filling partitioner(FFP),designed based on the proposed method,is shortened by 15.3%on average.In addition,FFP’s total running time is on average 38.7%shorter than balanced Spark data partitioner and 30.2%shorter than hash based key reassigning partitioner.

关键词：均衡分区优先填补策略数据倾斜 Spark算子大数据

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于优先填补策略的Spark数据均衡分区方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于优先填补策略的Spark数据均衡分区方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索