检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王壮 王平辉[1] 王彬丞 武文博 王斌 丛鹏宇 WANG Zhuang;WANG Pinghui;WANG Bincheng;WU Wenbo;WANG Bin;CONG Pengyu(Ministry of Education Key Lab for Intelligent Networks and Network Security,Xi'an Jiaotong University,Xi'an 710049,China;China Mobile Research Institute,Beijing 100053,China)
机构地区:[1]西安交通大学智能网络与网络安全教育部重点实验室,西安710049 [2]中国移动通信有限公司研究院,北京100053
出 处:《计算机科学》2023年第6期86-91,共6页Computer Science
基 金:国家重点研发计划(2021YFB1715600);国家自然科学基金(61902305,61922067);深圳市基础研究基金(JCYJ20170816100819428);“人工智能”教育部-中国移动建设项目(MCM20190701)。
摘 要:近年来,容器由于具有轻量级以及高可扩展性,逐渐替代了虚拟机,被广泛应用于深度学习云平台中。但目前深度学习云平台在GPU资源管理上依然存在着不足,主要表现为由于容器编排技术的限制,多个容器无法共享使用GPU资源,而对于一些小规模模型的训练任务和推理任务,单个任务并不能充分利用整张GPU卡的计算资源。当前的独占模式会导致昂贵的GPU资源的浪费,降低资源效率和服务可用性。针对这一问题,提出了一种GPU共享调度系统。一方面,基于Kubernetes的Operator机制对现有集群功能进行扩展,实现了多个Pod共享使用GPU资源,同时设计了一种代理机制保证了与原生Kubernetes的兼容性。另一方面,基于GPU时间片与抢占机制,实现了GPU资源的动态管理与调度,在多个任务之间进行细粒度的协调,并减少了任务干扰。实验结果表明,与原生Kubernetes调度系统相比,该系统能够将一组深度学习训练任务的完成时间平均减少约20%,使得集群GPU资源利用率平均提升约10%。在共享使用GPU时高优先级任务性能相较于独占GPU损耗不到5%,同时能够使得低优先级任务以20%的性能运行在同一张GPU上。In recent years,containers have gradually replaced virtual machines and are widely used in deep learning cloud platforms due to their lightweight and high scalability.However,the deep learning cloud platform still has deficiencies in GPU resource management,which are mainly manifested as multiple containers cannot share GPU resources due to the limitation of container orchestration technology.For some small-scale model training tasks and model inference tasks,a single task cannot fully utilize the computing resources of the entire GPU card.The current exclusive mode will result in a waste of expensive GPU resources,reduce resource efficiency and service availability.In response to this problem,this paper proposes a GPU sharing sche-duling system.On the one hand,the Kubernetes-based Operator mechanism extends the existing cluster functions,enabling multiple Pods to share GPU resources,and designs an agent mechanism to ensure that compatibility with native Kubernetes.On the other hand,based on the GPU time slice and preemption mechanism,the dynamic management and scheduling of GPU resources is realized,fine-grained coordination among multiple tasks is performed,and task interference is reduced.Experimental results show that compared with the native Kubernetes scheduling system,the proposed system can reduce the completion time of a group of deep learning training tasks by about 20%on average,and increase the utilization of cluster GPU resources by about 10%on average.When the GPU is shared,the performance loss of high-priority tasks is less than 5%compared to the exclusive GPU,and the low-priority tasks can run on the same GPU with 20%of the performance.
关 键 词:深度学习云平台 GPU共享 容器调度 DOCKER Kubernetes
分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:13.58.11.68