BAFT:bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism  

作  者:Runzhe CHEN Guandong LU Yakai WANG Rui ZHANG Zheng HU Yanming MIAO Zhifang CAI Jingwen LENG Minyi GUO 

机构地区:[1]School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China [2]Shanghai Qi Zhi Institution,Shanghai 200232,China [3]Huawei Technologies Co.,Ltd,Shenzhen 518129,China

出  处:《Frontiers of Computer Science》2025年第1期29-39,共11页计算机科学前沿(英文版)

基  金:supported by the National Key R&D Program of China(2021ZD0110104);the National Natural Science Foundation of China(Grant Nos.62222210,U21B2017,61832006,and 62072297).

摘  要:As deep neural networks (DNNs) have been successfully adopted in various domains, the training of these large-scale models becomes increasingly difficult and is often deployed on compute clusters composed of many devices like GPUs. However, as the size of the cluster increases, so does the possibility of failures during training. Currently, faults are mainly handled by recording checkpoints and recovering, but this approach causes large overhead and affects the training efficiency even when no error occurs. The low checkpointing frequency leads to a large loss of training time, while the high recording frequency affects the training efficiency. To solve this contradiction, we propose BAFT, a bubble-aware fault tolerant framework for hybrid parallel distributed training. BAFT can automatically analyze parallel strategies, profile the runtime information, and schedule checkpointing tasks at the granularity of pipeline stage depending on the bubble distribution in the training. It supports higher checkpoint efficiency and only introduces less than 1% time overhead, which allows us to record checkpoints at high frequency, thereby reducing the time loss in error recovery and avoiding the impact of fault tolerance on training.

关 键 词:distributed training fault tolerance CHECKPOINT pipeline parallelism error recovery 

分 类 号:TP3[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象