分布式深度学习通信架构的性能分析  被引量:3

Performance analysis of distributed deep learning communication architecture

在线阅读下载全文

作  者:张立志 冉浙江 赖志权 刘锋[1] ZHANG Li-zhi;RAN Zhe-jiang;LAI Zhi-quan;LIU Feng(Parallel and Distributed Key Laboratory of National Defense Technology,Colloge of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)

机构地区:[1]国防科技大学计算机学院并行与分布处理国防科技重点实验室,湖南长沙410073

出  处:《计算机工程与科学》2021年第3期416-425,共10页Computer Engineering & Science

基  金:国家重点研发计划(2018YFB0204301);国家自然科学基金(61702533)。

摘  要:近年来,深度学习技术的进步推动人工智能进入了一个新的发展时期。但是,海量的训练数据、超大规模的模型给深度学习带来了日益严峻的挑战,分布式深度学习应运而生,逐渐成为应对这一挑战的有效手段,而高效的参数通信架构是保证分布式深度学习性能的关键。针对传统分布式深度学习模型同步架构在大规模节点上并行训练的问题,首先,分析了集中式的Parameter Server和去中心化的Ring Allreduce这2种主流的参数通信架构的原理和性能。然后,在天河高性能GPU集群上基于TensorFlow构建了2种分布式训练架构的对比测试环境。最后,以Parameter Server架构为基准线,测试了Ring Allreduce架构在GPU集群环境下训练AlexNet和ResNet-50的对比性能。实验结果表明,在使用32个GPU的情况下,Ring Allreduce架构扩展效率可达97%,相比Parameter Server架构,其分布式计算性能可提升30%,验证了Ring Allreduce架构具有更好的可扩展性。In recent years,advances in deep learning technology have pushed artificial intelligence into a new era of development.However,massive training data and large-scale models have brought increasingly serious challenges to deep learning.Distributed deep learning is an effective method to meet this challenge.An efficient synchronization algorithm is the key to ensuring the performance of distributed deep learning.Aiming at the problem of parallel training of traditional distributed deep learning model synchronization algorithms on large-scale nodes,firstly,the principles and performance of two mainstream parameter communication architectures,centralized Parameter Server and decentralized Ring Allreduce,are analyzed.Secondly,a comparative test environment of two distributed training frameworks was constructed based on TensorFlow on the Tianhe high-performance GPU cluster.Finally,using the Parameter Server architecture as the baseline,the comparative performance of Ring Allreduce architecture for training AlexNet and ResNet-50 in a GPU cluster environment was tested.The experimental results show that,with 32 GPUs,the expansion efficiency of Ring Allreduce architecture can reach 97%.Compared with Parameter Server architecture,it increases the distributed computing performance by 30%,which verifies that Ring Allreduce architecture has better scalability.

关 键 词:Ring Allreduce 参数服务器 分布式训练 深度学习 深度神经网络 

分 类 号:TP301[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象