检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:何玉林 吴东彤 Philippe Fournier-Viger 黄哲学 HE Yu-lin;WU Dong-tong;Philippe Fournier-Viger;HUANG Zhe-xue(Guangdong Laboratory of Artificial Intelligence and Digital Economy(Shenzhen),Shenzhen,Guangdong 518107,China;College of Computer Science and Software Engineering,Shenzhen University,Shenzhen,Guangdong 518060,China)
机构地区:[1]人工智能与数字经济广东省实验室(深圳),广东深圳518107 [2]深圳大学计算机与软件学院,广东深圳518060
出 处:《电子学报》2024年第10期3322-3335,共14页Acta Electronica Sinica
基 金:深圳市科技重大专项项目(No.202302D074);广东省自然科学基金面上项目(No.2023A1515011667);深圳市基础研究面上项目(No.JCYJ20210324093609026);广东省基础与应用基础研究基金粤深联合基金重点项目(No.2023B1515120020)。
摘 要:Spark作为基于内存计算的分布式大数据处理框架,运行速度快且通用性强.在任务计算过程中,Spark的默认分区器HashPartitioner在处理倾斜数据时,容易产生各个分区数据量不平衡的情况,导致资源利用率低且运行效率差.现存的Spark均衡分区改进方法,例如多阶段分区、迁移分区和采样分区等,大多存在尺度把控难、通信开销成本高、对采样过度依赖等缺陷.为改善上述问题,本文提出了一种基于优先填补策略的分区方法,同时考虑了样本数据和非样本数据的分配,以便实现对全部数据的均衡分区.该方法在对数据采样并根据样本信息估算出每个键的权值后,将键按照权值大小降序排列,依次将键在满足分区容忍度的条件下分配到前面的分区中,为未被采样的键预留后面的分区空间,以获得针对样本数据的分区方案.Spark根据分区方案对样本中出现的键对应的数据进行分区,没有出现的键对应的数据则直接映射到可分配的最后一个分区中.实验结果表明,新分区方法能够有效实现Spark数据的均衡分区,在美国运输统计局发布的真实航空数据集上,基于该方法设计的优先填补分区器的总运行时间比HashPartitioner平均缩短了15.3%,比现有的均衡数据分区器和哈希键值重分配分区器分别平均缩短了38.7%和30.2%.Spark is a distributed big data processing framework based on in-memory computing,which has the advantages of fast running speed and strong versatility.When conducting the computation task,Spark’s default partitioner HashPartitioner is easy to generate data skewing among partitions.It results in low resource utilization and poor operating efficiency.Most of the existing Spark balanced partitioning methods,such as multi-stage partitioning,migration partitioning,and sampling partitioning,have defects of scale control difficulty,high communication overhead,and excessive sampling dependence.In order to solve the above-mentioned problems,we propose a partitioning method based on first filling strategy,which considers the allocations of sample data and non-sample data at the same time,so as to achieve a balanced data partitioning.After sampling the data and estimating the weight of each key according to the sample information,the keys are sorted in descending order according to the weights.The keys are in turn assigned to the previous partitions if their additions can satisfy the partition tolerance,and the space of the last partition is reserved for the keys that are not sampled,so as to obtain the partitioning plan for the sample data.Spark partitions the data corresponding to the keys that appear in the sample according to the partitioning plan,and the data of other keys that do not appear is directly allocated to the last data partition available.The experimental results show that the new method can effectively achieve balanced partitioning for Spark data.On the real datasets from Bureau of Transportation Statistics,compared with HashPartitioner,the total running time of first filling partitioner(FFP),designed based on the proposed method,is shortened by 15.3%on average.In addition,FFP’s total running time is on average 38.7%shorter than balanced Spark data partitioner and 30.2%shorter than hash based key reassigning partitioner.
关 键 词:均衡分区 优先填补策略 数据倾斜 Spark算子 大数据
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.118.253.134