检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:田凌枫 匡立伟 盖政辰 周可可 TIAN Lingfeng;KUANG Liwei;GAI Zhengchen;ZHOU Keke(Department of Communication and Information Systems,Wuhan Research Institute of Posts and Telecommunications,WuHan 430074,China;Network Production Line,Fiberhome Communication Technologies Co,Ltd,WuHan 430074,China)
机构地区:[1]武汉邮电科学研究院通信与信息系统系,武汉430074 [2]烽火通信科技股份有限公司网络产出线,武汉430074
出 处:《网络新媒体技术》2025年第1期41-49,68,共10页Network New Media Technology
基 金:广东省重点领域研发计划(编号:2021B0101400005)。
摘 要:弹性学习在提高深度学习资源利用率与加速训练方面具有重要应用,本文针对深度学习应用场景中弹性方法重启时间过长影响训练效率,以及重启后需重新进行资源分配等问题,提出基于一范数的多资源调度弹性重启方法。综合考虑多类资源状况,通过构建资源分配矩阵,整合所有任务与资源,使用一范数求解来选取弹性重启后的较优资源分配。同时,在弹性重启周期内,针对前向传播层级进行断点状态保存,实现训练中断后的快速恢复,降低重启过程中的进度损失风险。实验结果表明,该模型粒度弹性重启方法相较于传统方法在重启过程中能减少6.3%~18.4%的系统开销,在有限资源下训练多项任务时的吞吐量能够达到主流方法的1.23~2.16倍,并能缓解弹性重启过程中训练进度的损失。Elastic learning plays a crucial role in enhancing resource utilization and accelerating training in deep learning.This paper addresses issues in deep learning scenarios where the restart time of elastic methods is excessively long,affecting training efficiency,and where resource reallocation is required in post-restart.We propose a multi-resource scheduling elastic restart method based on the L1 norm.By comprehensively considering various resource conditions,we construct a resource allocation matrix that integrates all tasks and resources,using the L1 norm to determine the optimal resource allocation after an elastic restart.Additionally,during the elastic restart cycle,we implement checkpoint state saving for the forward propagation layers,enabling rapid recovery after training interruptions and reducing the risk of progress loss during restarts.Experimental results demonstrate that this model’s fine-grained elastic restart method can reduce system overhead by 6.3%to 18.4%compared to traditional methods during the restart process.It also achieves throughput 1.23 to 2.16 times higher than mainstream methods when training multiple tasks under limited resources,while mitigating the loss of training progress during elastic restarts.
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.198