基于ARM的硬件压缩算法在Spark中的性能研究  

A Performance Study of ARM-Based Hardware-Based Compression Algorithms in Spark

在线阅读下载全文

作  者:朱常鹏 汤景仁 梁昀 张小川[1] 韩博[3] 赵银亮[4] ZHU Chang-Peng;TANG Jing-Ren;LIANG Yun;ZHANG Xiao-Chuan;Han Bo;Zhao Yin-Liang(Department of Data Science and Big Data,Chongqing University of Technology,Chongqing 401135;Huawei Technologies Co.,Ltd,Shenzhen,Guangdong 518100;School of Cyber Science and Engineering,Xi'an Jiaotong University,Xi'an 710049;School of Computer Science and Technology,Xi'an Jiaotong University,Xi'an 710049)

机构地区:[1]重庆理工大学数据科学与大数据系,重庆401135 [2]华为科技有限公司,广东深圳518100 [3]西安交通大学网络空间安全学院,西安710049 [4]西安交通大学计算机科学与技术学院,西安710049

出  处:《计算机学报》2023年第12期2626-2650,共25页Chinese Journal of Computers

基  金:鲲鹏众智计划中的Spark使能KAE压缩项目(OAA21091100464724D);国家留学基金委员会(201708505099);国家自然科学基金(61702063)资助。

摘  要:鲲鹏920 CPU是2021年面世、全球第一款基于7纳米制造工艺的ARM 64位CPU,该CPU内置一个名为KAEzip的硬件加速引擎,其核心是一个硬件压缩算法,能通过硬件提升压缩与解压缩性能.相关研究表明,压缩算法的硬化与传统软件压缩算法相比具备明显性能优势.但大数据领域中的基础性系统软件都无法识别和使用这类算法.因此研究评估硬件压缩算法在大数据环境下的性能,发现揭示制约这类算法性能的关键因素以及可能存在的缺陷具有重要意义.为此,本文首先提出一种基于“生产-消费”模型的Spark任务性能模型,形式化地表示多维资源、压缩算法和Spark任务性能之间的内在关系,从理论上分析揭示出Spark下影响压缩算法性能的关键因素.然后提出一种三层架构支持Spark识别使用硬件压缩算法.这种分层架构为进一步调优硬件压缩算法在Spark中的性能提供了灵活性,也能复用到其他大数据系统软件.在此基础上本文以KAEzip为实验对象,使用经典Spark基准测试程序全面评估它在Spark中的性能,结合性能模型分析挖掘制约KAEzip性能的关键因素与根源.对KAEzip的测试表明:(1)硬件压缩算法可有效提升Spark性能。比如,KAEzip比snappy有最多13.8%的压缩性能优势、最多7%的解压优势和最多5.7%的实际应用场景下的性能优势;(2)磁盘的数据传输率与硬件压缩算法性能之间的不匹配是制约硬件压缩算法性能的重要因素;(3)压缩算法在Spark中的运行机制更易导致CPU的数据处理能力与硬件压缩算法性能不匹配,也制约着硬件压缩算法的性能.测试结果也表明KAEzip在压缩小数据时会导致数据膨胀问题.为此,本文扩展三层架构分析揭示出导致该问题的根源,并结合压缩算法在Spark中的运行机制提出一种优化方法.硬件压缩算法作为压缩算法领域的新研究方向,本文的研究工作不仅可广泛用于优化内置于CPU中的As the first 7-nm ARM-based 64-bit multi-core CPU,Kunpeng 920 stands out as one of the notable landmarks.The CPU is equipped with a hardware accelerator for compression,i.e.KAEzip.Recent research efforts indicate that hardware-based compression algorithms have significant performance advantages over traditional ones.However,the most of the crucial foundational software in big data fields,for example,Hadoop and Spark,cannot recognize and leverage them.Therefore,how such algorithms perform in big data environment remains to be an open issue.To address this problem,this paper first proposes a“producer-customer”-based model for Spark tasks to formally describe the relationships among compression,multi-dimension hardware resources,and execution times of Spark tasks,and then extract what crucial factors have influence on compression in Spark.Afterwards,a three-layer architecture on top of Spark is proposed to enable Spark to use hardware-based compression algorithms.The layered architecture is quite easy to be extended for optimizing their performance in Spark and is conveniently reused in other big data software.Based on the model and a series of experiments,we evaluate the performance of KAEzip in Spark and then reveal root causes,which have a vital influence on KAEzip.Our experiments indicate that hardware-based compression algorithms can bring significant performance upgrades,for example,KAEzip outperforms snappy by up to 13.8%,7%and 5.7%in compression,decompression and LDA,respectively,and that data transmission speed of disks and data process capability of CPU have a decisive influence on the performance of compression algorithms in Spark,to some extent.These experiments furthermore demonstrate that KAEzip may cause data inflation problem.Finally,this paper presents an optimization on top of the architecture to reveal the root cause of the problem and then resolve it by exploiting the working mechanism of compression algorithms in Spark.This research work not only promotes the optimization and evolution of KA

关 键 词:鲲鹏920 CPU KAEzip 大数据 SPARK 硬件压缩算法 根源分析 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象