面向GPU集群的动态资源调度方法被引量：4

Dynamic Resource Scheduling Method for GPU Cluster

作　　者：傅懋钟胡海洋[1] 李忠金[1,2] Fu Maozhong;Hu Haiyang;Li Zhongjin(School of Computer Science and Technology,Hangzhou Dianzi University,Hangzhou 310018;Intelligent Software Technology and Application Research Center,Advanced Institute of Information Technology,Peking University,Zhejiang Province,Hangzhou 311215)

机构地区：[1]杭州电子科技大学计算机学院,杭州310018 [2]浙江省北大信息技术高等研究院智能软件技术与应用研究中心,杭州311215

出　　处：《计算机研究与发展》2023年第6期1308-1321,共14页Journal of Computer Research and Development

基　　金：浙江省自然科学基金项目(LY22F020021);浙江省重点研发计划“领雁”项目(2023C01145);国家自然科学基金项目(61802095,61572162)。

摘　　要：深度神经网络(deep neural network,DNN)已广泛应用于人类社会的许多领域.大规模的DNN模型可显著提高识别精度,然而在单个GPU设备上训练大规模的DNN模型需要耗费大量的时间.因此,如何借助分布式深度学习(distributed deep learning,DDL)技术,在GPU集群上并行地训练多DNN模型已受到工业界和学术界的广泛关注.基于此,提出一种面向GPU集群的动态资源调度(dynamic resource scheduling,DRS)方法,解决异构带宽环境下具有截止时间要求的多DNN任务调度问题.具体来说,首先基于Ring-AllReduce通信方式构建资源-时间模型,以衡量DDL任务在不同资源方案下的运行时间;然后基于截止时间需求构建了资源-性能模型,以实现高效的资源利用;最后,结合上述资源-时间和资源-性能模型设计了DRS算法,为多DNN任务训练实现资源方案决策.在DRS算法中融入最近截止时间原则进行实际资源分配,并利用资源迁移机制减少调度过程中出现的资源碎片场景的影响.在4个NVIDIA GeForce RTX 2080 Ti的GPU集群上的异构带宽的实验表明,DRS相较于对比算法提升了39.53%的截止时间保证率,并在调度过程中GPU集群节点的资源利用率达到了91.27%.Deep neural network(DNN)has been widely used in many areas of human society.Increasing the size of DNN model significantly improves the model accuracy,however,training DNN model on a single GPU requires considerable time.Hence,how to train large-scale DNN models in parallel on GPU cluster by distributed deep learning(DDL)technology has been paid much attention by industry and academia.Based on the above analysis,we propose a dynamic resource scheduling(DRS)method for GPU cluster in the heterogeneous GPU cluster environment with different bandwidth among GPUs.The goal of DRS is to solve the multi-DNN scheduling problem with the requirement of deadline constraint.Specifically,firstly,a resource-time model is constructed based on the Ring-AllReduce communication architecture to measure the running time of DDL tasks under different resource schemes.Then,a resource-performance model is built based on the deadline requirement to achieve efficient resource utilization;Finally,DRS is designed to implement resource scheme decision for DDL tasks based on the above model and resource layout.In this way,scheduling tasks are selected for actual resource allocation based on the principle of nearest deadline,and a migration mechanism is introduced to reduce the impact of resource fragmentation scenarios in the scheduling process.Experiments on the heterogeneous GPU cluster with 4 NVIDIA GeForce RTX 2080 Tis show that DRS improves the deadline guaranteeing rate by 39.53%compared with the comparison algorithms,and the resource utilization of GPU cluster reaches 91.27%in the scheduling process.

关键词：资源调度 GPU集群分布式深度学习异构带宽资源迁移

分类号：TP301.6[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向GPU集群的动态资源调度方法被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向GPU集群的动态资源调度方法 被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

面向GPU集群的动态资源调度方法被引量：4