面向固态硬盘的Spark数据持久化方法设计  被引量:3

Design of RDD Persistence Method in Spark for SSDs

在线阅读下载全文

作  者:陆克中[1] 朱金彬 李正民[4] 隋秀峰[3,5] Lu Kezhong;Zhu Jinbin;Li Zhengmin;Sui Xiufeng(College of ComputerScience&Software Engineering,Shenzhen University,Shenzhen,Guangdong 518060;School of Computer Science and Technology,Guangdong University of Technology,Guangzhou 511400;National Computer Netxvork Emergency Response Technical Tearn\Coordination Center of China,Beijing 100029;State Key Laboratory of Computer Architecture(Institute of Computing Technology,Chinese Academy of Sciences),Beijing 100190;Strategic Studies Centre,Chinese Academy of Engineering,Beijing 100088)

机构地区:[1]深圳大学计算机与软件学院,广东深圳518060 [2]广东工业大学计算机学院,广州511400 [3]计算机体系结构国家重点实验室(中国科学院计算技术研究所),北京100190 [4]国家计算机网络应急技术处理协调中心,北京100029 [5]中国工程院战略咨询中心,北京100088

出  处:《计算机研究与发展》2017年第6期1381-1390,共10页Journal of Computer Research and Development

基  金:国家“八六三”高技术研究发展计划基金项目(2015AA015305);广东省自然科学基金项目(2014A030313553);广东省省部产学研项目(2013B090500055);深圳市基础研究学科布局项目(JCYJ20150529164656096)~~

摘  要:基于固态硬盘(solid-state drive,SSD)和硬盘(hard disk drive,HDD)混合存储的数据中心已经成为大数据计算领域的高性能载体,数据中心负载应该可将不同特性的数据按需持久化到SSD或HDD,以提升系统整体性能.Spark是目前产业界广泛使用的高效大数据计算框架,尤其适用于多次迭代计算的应用领域,其原因在于Spark可以将中间数据持久化在内存或硬盘中,且持久化数据到硬盘打破了内存容量不足对数据集规模的限制.然而,当前的Spark实现并未专门提供显式的面向SSD的持久化接口,尽管可根据配置信息将数据按比例分布到不同的存储介质中,但是用户无法根据数据特征按需指定RDD的持久化存储介质,针对性和灵活性不足.这不仅成为进一步提升Spark性能的瓶颈,而且严重影响了混合存储系统性能的发挥.有鉴于此,首次提出面向SSD的数据持久化策略.探索了Spark数据持久化原理,基于混合存储系统优化了Spark的持久化架构,最终通过提供特定的持久化API实现用户可显式、灵活指定RDD的持久化介质.基于SparkBench的实验结果表明,经本方案优化后的Spark与原生版本相比,其性能平均提升14.02%.SSD(solid-state drive)a n d H D D(hard disk drive)hybrid storage system has be enwidelyused in big data computing datacenters.The work loads should be able to persist data of differentcharacteristics to SSD or HDD on demand to improve the overall performance of the system.Spark isan industry-wide efficient data computing framework,especially for the applications with multipleiterations.The reason is that Spark can persist data in memory or hard disk,a n d persisting data to thehard disk can break the insufficient memory limits on the size of the data set.How e v e r,the currentSpark implementation does not specifically provide an explicit SSD-oriented persistence interface,although data can be distributed proportionally to different storage mediums based on configurationinformation,and the user can not specify RDD's persistence locations according to the datacharacteristics,a n d thus the lack of relevance and flexibility.This has not only become a bottleneck tofurther e nhance the performance of Spark,but also seriously affected the played p erformance of hybridstorage system.This paper presents the data persistence strategy for SSD for the first time as wek now.We explore the data persistence principle in Spark,and optimize the architecture based onhybrid storage system.Finally,users can specify RDD,s storage mediums explicitly and flexiblyleveraging the persistence API we provided.Experimental results based on Spark Bench shows that thep erformance can be improved by an average of14.02%.

关 键 词:大数据 混合存储 固态硬盘 SPARK 持久化 

分 类 号:TP303[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象