Spark并行计算框架的内存优化  被引量:10

Memory optimization of Spark parallel computing framework

在线阅读下载全文

作  者:廖旺坚 黄永峰[1,2] 包从开 LIAO Wang-jian;HUANG Yong-feng;BAO Cong-kai(Department of Electronic Engineering,Tsinghua University,Beijing 100084;National Laboratory for Information Science and Technology(TNList),Tsinghua University,Beijing 100084,China)

机构地区:[1]清华大学电子工程系,北京100084 [2]清华大学信息科学与技术国家实验室(筹),北京100084

出  处:《计算机工程与科学》2018年第4期587-593,共7页Computer Engineering & Science

基  金:国家科技支撑计划(2014BAH41B00);国家自然科学基金(U1405254;U1536207)

摘  要:以Spark为代表的集群并行计算框架在大数据、云计算浪潮中广泛应用,其运行性能优化是应用的关键。为提高运行性能,分析了Spark框架执行流程、内存管理机制,结合Spark和JVM两个层面内存管理的特点,提出3条优化策略:(1)通过序列化和压缩方式减少缓存数据大小,使得GC消耗降低,提升性能;(2)在一定范围内减少运行内存大小,用重算代替缓存,可以提升性能;(3)配置适当的JVM新生代和老生代的比例、Spark计算与缓存空间比例等内存分配参数,能够较大程度地提升性能。实验结果表明,序列化和压缩能够减少缓存占用空间42%;提交运行内存由1 000MB减少到800MB时,性能增加21%;优化内存配比,性能比默认参数有10%~30%的提升。The cluster parallel computing framework represented by Spark is widely used in the big data and cloud computing,and its performance optimization is the key in applications.The paper analyzes the framework of the execution process and memory management mechanism of Spark framework.Combining the characteristics of Spark and JVM memory management,three strategies are proposed:(1)Serialization and compression are used to reduce the cache data size and reduce the occupied memory space,then reduce the GC consumption,thus improving the performance.(2)The running memory size is reduced within a certain range,and recalculation replaces the cache,thus improving the performance.(3)By adjusting the proportion of the old generation and new generation of the JVM,the ratio of Spark computing and cache space,and other memory allocation parameters,the performance can be improved greatly.Experiments show that the serialization and compression can reduce the cache space by 42%,the performance is increased by 21%when the submitting memory is reduced from 1 000 MB to 800 MB,and optimizing the memory ratio can improve the performance by 10%to 30%.

关 键 词:SPARK 性能优化 堆内存 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象