填充性载荷:减少集群资源浪费与深度学习训练成本的负载  

Padding Load:Load Reducing Cluster Resource Waste and Deep Learning Training Costs

在线阅读下载全文

作  者:杜昱 俞子舒 彭晓晖 徐志伟[1,2] DU Yu;YU Zishu;PENG Xiaohui;XU Zhiwei(Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China)

机构地区:[1]中国科学院计算技术研究所,北京100190 [2]中国科学院大学,北京100049

出  处:《计算机科学》2024年第9期71-79,共9页Computer Science

基  金:北京市自然科学基金(4212027);国家自然科学基金(62072434)。

摘  要:近年来,大模型在生物信息学、自然语言处理和计算机视觉等多个领域取得了显著成功。然而,这些模型在训练和推理阶段需要大量的计算资源,导致计算成本高昂。同时,计算集群中存在资源利用率低、任务调度难的供需失衡问题。为了解决这一问题,提出了填充性载荷的概念,即一种在计算集群中利用空闲资源进行计算的负载。填充性载荷的计算资源随时可能被其他负载抢占,但其使用的资源优先级较低,资源成本也相对较低。为此,设计了适用于填充性载荷的分布式深度学习训练框架PaddingTorch。基于阿里巴巴PAI集群的数据,使用4块GPU模拟了任务切换最频繁的4个GPU时间段上的作业调度情况,使用PaddingTorch将蛋白质复合物预测程序作为填充性载荷进行训练。训练时长为独占资源时训练时长的2.8倍,但训练成本降低了84%,在填充性载荷填充时间段内GPU资源利用率提升了25.8%。In recent years,large-scale models have achieved remarkable success in multiple domains such as bioinformatics,natural language processing,and computer vision.However,these models often require substantial computational resources during the training and inference stages,resulting in considerable computational costs.Additionally,computing clusters experience imba-lances between supply and demand,manifesting as low resource utilization and difficulties in task scheduling.To address this problem,the concept of Padding Load is introduced,which leverages idle computing resources for computational tasks.Resources allocated to Padding Load can be preempted by other tasks at any time.However,they operate with a lower resource priority,leading to relatively lower costs.PaddingTorch is a distributed deep learning training framework tailored for Padding Load.Utilizing data from the Alibaba PAI cluster,job scheduling is simulated on four GPUs,specifically during peak task-switching intervals.PaddingTorch is employed to train a protein complex prediction model using the Padding Load approach.While the training duration is 2.8 times that of exclusive resource usage,there is an 84%reduction in training costs and a 25.8%increase in GPU resource utilization during the periods when Padding Load is employed.

关 键 词:深度学习 分布式训练 资源利用率 计算集群 编程框架 

分 类 号:TP312[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象