检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:Runzhe CHEN Guandong LU Yakai WANG Rui ZHANG Zheng HU Yanming MIAO Zhifang CAI Jingwen LENG Minyi GUO
机构地区:[1]School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China [2]Shanghai Qi Zhi Institution,Shanghai 200232,China [3]Huawei Technologies Co.,Ltd,Shenzhen 518129,China
出 处:《Frontiers of Computer Science》2025年第1期29-39,共11页计算机科学前沿(英文版)
基 金:supported by the National Key R&D Program of China(2021ZD0110104);the National Natural Science Foundation of China(Grant Nos.62222210,U21B2017,61832006,and 62072297).
摘 要:As deep neural networks (DNNs) have been successfully adopted in various domains, the training of these large-scale models becomes increasingly difficult and is often deployed on compute clusters composed of many devices like GPUs. However, as the size of the cluster increases, so does the possibility of failures during training. Currently, faults are mainly handled by recording checkpoints and recovering, but this approach causes large overhead and affects the training efficiency even when no error occurs. The low checkpointing frequency leads to a large loss of training time, while the high recording frequency affects the training efficiency. To solve this contradiction, we propose BAFT, a bubble-aware fault tolerant framework for hybrid parallel distributed training. BAFT can automatically analyze parallel strategies, profile the runtime information, and schedule checkpointing tasks at the granularity of pipeline stage depending on the bubble distribution in the training. It supports higher checkpoint efficiency and only introduces less than 1% time overhead, which allows us to record checkpoints at high frequency, thereby reducing the time loss in error recovery and avoiding the impact of fault tolerance on training.
关 键 词:distributed training fault tolerance CHECKPOINT pipeline parallelism error recovery
分 类 号:TP3[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.142.152.51