基于一范数的多资源调度弹性重启方法

Multi-Resource Scheduling Elastic Restart Method Based on L1 Norm

作　　者：田凌枫匡立伟盖政辰周可可 TIAN Lingfeng;KUANG Liwei;GAI Zhengchen;ZHOU Keke(Department of Communication and Information Systems,Wuhan Research Institute of Posts and Telecommunications,WuHan 430074,China;Network Production Line,Fiberhome Communication Technologies Co,Ltd,WuHan 430074,China)

机构地区：[1]武汉邮电科学研究院通信与信息系统系,武汉430074 [2]烽火通信科技股份有限公司网络产出线,武汉430074

出　　处：《网络新媒体技术》2025年第1期41-49,68,共10页Network New Media Technology

基　　金：广东省重点领域研发计划(编号:2021B0101400005)。

摘　　要：弹性学习在提高深度学习资源利用率与加速训练方面具有重要应用,本文针对深度学习应用场景中弹性方法重启时间过长影响训练效率,以及重启后需重新进行资源分配等问题,提出基于一范数的多资源调度弹性重启方法。综合考虑多类资源状况,通过构建资源分配矩阵,整合所有任务与资源,使用一范数求解来选取弹性重启后的较优资源分配。同时,在弹性重启周期内,针对前向传播层级进行断点状态保存,实现训练中断后的快速恢复,降低重启过程中的进度损失风险。实验结果表明,该模型粒度弹性重启方法相较于传统方法在重启过程中能减少6.3%~18.4%的系统开销,在有限资源下训练多项任务时的吞吐量能够达到主流方法的1.23~2.16倍,并能缓解弹性重启过程中训练进度的损失。Elastic learning plays a crucial role in enhancing resource utilization and accelerating training in deep learning.This paper addresses issues in deep learning scenarios where the restart time of elastic methods is excessively long,affecting training efficiency,and where resource reallocation is required in post-restart.We propose a multi-resource scheduling elastic restart method based on the L1 norm.By comprehensively considering various resource conditions,we construct a resource allocation matrix that integrates all tasks and resources,using the L1 norm to determine the optimal resource allocation after an elastic restart.Additionally,during the elastic restart cycle,we implement checkpoint state saving for the forward propagation layers,enabling rapid recovery after training interruptions and reducing the risk of progress loss during restarts.Experimental results demonstrate that this model’s fine-grained elastic restart method can reduce system overhead by 6.3%to 18.4%compared to traditional methods during the restart process.It also achieves throughput 1.23 to 2.16 times higher than mainstream methods when training multiple tasks under limited resources,while mitigating the loss of training progress during elastic restarts.

关键词：弹性部署资源调度范数深度学习分布式集群

分类号：TP18[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于一范数的多资源调度弹性重启方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于一范数的多资源调度弹性重启方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索