检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:Xuanzhe LIU Yihao ZHAO Shufan LIU Xiang LI Yibo ZHU Xin LIU Xin JIN
机构地区:[1]School of Computer Science,Peking University,Beijing 100871,China [2]Key Laboratory of High Confidence Software Technologies(Peking University),Ministry of Education,Beijing 100871,China [3]ByteDance,Beijing 100006,China [4]StepFun,Shanghai 200232,China
出 处:《Science China(Information Sciences)》2024年第12期119-135,共17页中国科学(信息科学)(英文版)
基 金:supported by National Natural Science Foundation of China(Grant Nos.62325201,62172008);National Natural Science Fund for the Excellent Young Scientists Fund Program(Overseas);the PKU-Byte Dance Joint-Lab Program。
摘 要:Large-scale GPU clusters are widely used to speed up both latency-critical(online)and besteffort(offline)deep learning(DL)workloads.However,similar to the common practice,the DL clusters at ByteDance dedicate each GPU to one workload or share workloads in time dimension,leading to very low GPU resource utilization.Existing techniques like NVIDIA MPS provide an opportunity to share multiple workloads in space on widely-deployed NVIDIA GPUs,but it cannot guarantee the performance of online workloads.We present MuxFlow,the first production system that can scale over massive GPUs to support highly efficient space-sharing for DL workloads.MuxFlow introduces a two-level protection mechanism for both memory and computation to guarantee the performance of online workloads.MuxFlow leverages dynamic streaming multiprocessor(SM)allocation to improve the efficiency of offline workloads.Based on our practical error analysis,we design a mixed error-handling mechanism to improve system reliability.MuxFlow has been deployed at ByteDance on more than 18000 GPUs.The deployment results indicate that MuxFlow substantially improves the GPU utilization from 26%to 76%,SM activity from 16%to 33%,and GPU memory usage from 42%to 48%.
关 键 词:GPU cluster deep learning workload cluster management GPU sharing deployed system
分 类 号:TP3[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.16.135.179