检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:巨涛 刘帅 火久元[1] 张学军[1] JU Tao;LIU Shuai;HUO Jiu-yuan;ZHANG Xue-jun(School of Electronic and Information Engineering,Lanzhou Jiaotong University,Lanzhou 730070,China)
机构地区:[1]兰州交通大学电子与信息工程学院,兰州730070
出 处:《吉林大学学报(工学版)》2024年第12期3601-3613,共13页Journal of Jilin University:Engineering and Technology Edition
基 金:国家自然科学基金项目(61862037,62262038);兰州市人才创新创业项目(2021-RC-40);兰州交通大学天佑创新团队项目(TY202002)。
摘 要:针对大规模深度神经网络模型并行面临的内存消耗大、设备利用率低、训练时间长、模型难以收敛的问题,提出了一种面向深度神经网络模型并行的计算任务自适应调度方法。通过建立模型并行计算任务的多迭代异步并行管理机制,控制微批量单元具体调度过程,实现模型合理分区和计算资源合理分配,解决异步迭代时产生的梯度延迟更新问题;基于拓扑感知设计计算资源的分配机制,实现模型训练任务和计算资源的合理匹配;设计计算资源和模型任务的运行时调度策略,实现深度学习模型训练过程中计算与通信重叠的最大化,提高计算资源利用率。实验结果表明:与已有的模型并行方法相比,本文方法可以充分利用各GPU计算资源,在保证模型训练精度的同时,可以将大规模深度神经网络模型训练速度平均提高2.8倍。Aiming at the problems of large memory consumption,low equipment utilization,long training time and difficult convergence that occurred in the training process of the large-scale deep neural network(DNN)model,an adaptive parallel task scheduling method for the large-scale DNN model was proposed.Firstly,a multi-iteration asynchronous parallel management mechanism for model parallel was established,the specific scheduling process of micro-batch units was controlled to realize rational model partitioning and allocation of computing resources,and solve the problem of gradient delay updating during asynchronous iteration.Secondly,a computing resource allocation mechanism was designed based on topology awareness to achieve the best matching between model training tasks and computing resources.Finally,the runtime scheduling strategy for computing resources and model tasks is designed to maximize the overlap between computation and communication in the training process of fine-grained deep learning model,and improve the utilization of computing resources.Experimental results show that,compared with the existing model parallel methods,the proposed scheduling strategy can make full use of the computing resources of each GPU,and improve the training speed of large-scale DNN models by 2.8 times on average while ensuring the training accuracy of the model.
关 键 词:并行计算 深度神经网络模型并行 流水线并行 异步并行 任务调度 计算通信重叠
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49