深度神经网络模型并行自适应计算任务调度方法  

Adaptive scheduling of computing tasks for deep neural network model parallelism

在线阅读下载全文

作  者:巨涛 刘帅 火久元[1] 张学军[1] JU Tao;LIU Shuai;HUO Jiu-yuan;ZHANG Xue-jun(School of Electronic and Information Engineering,Lanzhou Jiaotong University,Lanzhou 730070,China)

机构地区:[1]兰州交通大学电子与信息工程学院,兰州730070

出  处:《吉林大学学报(工学版)》2024年第12期3601-3613,共13页Journal of Jilin University:Engineering and Technology Edition

基  金:国家自然科学基金项目(61862037,62262038);兰州市人才创新创业项目(2021-RC-40);兰州交通大学天佑创新团队项目(TY202002)。

摘  要:针对大规模深度神经网络模型并行面临的内存消耗大、设备利用率低、训练时间长、模型难以收敛的问题,提出了一种面向深度神经网络模型并行的计算任务自适应调度方法。通过建立模型并行计算任务的多迭代异步并行管理机制,控制微批量单元具体调度过程,实现模型合理分区和计算资源合理分配,解决异步迭代时产生的梯度延迟更新问题;基于拓扑感知设计计算资源的分配机制,实现模型训练任务和计算资源的合理匹配;设计计算资源和模型任务的运行时调度策略,实现深度学习模型训练过程中计算与通信重叠的最大化,提高计算资源利用率。实验结果表明:与已有的模型并行方法相比,本文方法可以充分利用各GPU计算资源,在保证模型训练精度的同时,可以将大规模深度神经网络模型训练速度平均提高2.8倍。Aiming at the problems of large memory consumption,low equipment utilization,long training time and difficult convergence that occurred in the training process of the large-scale deep neural network(DNN)model,an adaptive parallel task scheduling method for the large-scale DNN model was proposed.Firstly,a multi-iteration asynchronous parallel management mechanism for model parallel was established,the specific scheduling process of micro-batch units was controlled to realize rational model partitioning and allocation of computing resources,and solve the problem of gradient delay updating during asynchronous iteration.Secondly,a computing resource allocation mechanism was designed based on topology awareness to achieve the best matching between model training tasks and computing resources.Finally,the runtime scheduling strategy for computing resources and model tasks is designed to maximize the overlap between computation and communication in the training process of fine-grained deep learning model,and improve the utilization of computing resources.Experimental results show that,compared with the existing model parallel methods,the proposed scheduling strategy can make full use of the computing resources of each GPU,and improve the training speed of large-scale DNN models by 2.8 times on average while ensuring the training accuracy of the model.

关 键 词:并行计算 深度神经网络模型并行 流水线并行 异步并行 任务调度 计算通信重叠 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象