检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:傅懋钟 胡海洋[1] 李忠金[1,2] Fu Maozhong;Hu Haiyang;Li Zhongjin(School of Computer Science and Technology,Hangzhou Dianzi University,Hangzhou 310018;Intelligent Software Technology and Application Research Center,Advanced Institute of Information Technology,Peking University,Zhejiang Province,Hangzhou 311215)
机构地区:[1]杭州电子科技大学计算机学院,杭州310018 [2]浙江省北大信息技术高等研究院智能软件技术与应用研究中心,杭州311215
出 处:《计算机研究与发展》2023年第6期1308-1321,共14页Journal of Computer Research and Development
基 金:浙江省自然科学基金项目(LY22F020021);浙江省重点研发计划“领雁”项目(2023C01145);国家自然科学基金项目(61802095,61572162)。
摘 要:深度神经网络(deep neural network,DNN)已广泛应用于人类社会的许多领域.大规模的DNN模型可显著提高识别精度,然而在单个GPU设备上训练大规模的DNN模型需要耗费大量的时间.因此,如何借助分布式深度学习(distributed deep learning,DDL)技术,在GPU集群上并行地训练多DNN模型已受到工业界和学术界的广泛关注.基于此,提出一种面向GPU集群的动态资源调度(dynamic resource scheduling,DRS)方法,解决异构带宽环境下具有截止时间要求的多DNN任务调度问题.具体来说,首先基于Ring-AllReduce通信方式构建资源-时间模型,以衡量DDL任务在不同资源方案下的运行时间;然后基于截止时间需求构建了资源-性能模型,以实现高效的资源利用;最后,结合上述资源-时间和资源-性能模型设计了DRS算法,为多DNN任务训练实现资源方案决策.在DRS算法中融入最近截止时间原则进行实际资源分配,并利用资源迁移机制减少调度过程中出现的资源碎片场景的影响.在4个NVIDIA GeForce RTX 2080 Ti的GPU集群上的异构带宽的实验表明,DRS相较于对比算法提升了39.53%的截止时间保证率,并在调度过程中GPU集群节点的资源利用率达到了91.27%.Deep neural network(DNN)has been widely used in many areas of human society.Increasing the size of DNN model significantly improves the model accuracy,however,training DNN model on a single GPU requires considerable time.Hence,how to train large-scale DNN models in parallel on GPU cluster by distributed deep learning(DDL)technology has been paid much attention by industry and academia.Based on the above analysis,we propose a dynamic resource scheduling(DRS)method for GPU cluster in the heterogeneous GPU cluster environment with different bandwidth among GPUs.The goal of DRS is to solve the multi-DNN scheduling problem with the requirement of deadline constraint.Specifically,firstly,a resource-time model is constructed based on the Ring-AllReduce communication architecture to measure the running time of DDL tasks under different resource schemes.Then,a resource-performance model is built based on the deadline requirement to achieve efficient resource utilization;Finally,DRS is designed to implement resource scheme decision for DDL tasks based on the above model and resource layout.In this way,scheduling tasks are selected for actual resource allocation based on the principle of nearest deadline,and a migration mechanism is introduced to reduce the impact of resource fragmentation scenarios in the scheduling process.Experiments on the heterogeneous GPU cluster with 4 NVIDIA GeForce RTX 2080 Tis show that DRS improves the deadline guaranteeing rate by 39.53%compared with the comparison algorithms,and the resource utilization of GPU cluster reaches 91.27%in the scheduling process.
关 键 词:资源调度 GPU集群 分布式深度学习 异构带宽 资源迁移
分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.140.254.100