研发类GPU集群任务数据集的构建及分析  

Constructing and analyzing deep learning task dataset for R&D GPU clusters

在线阅读下载全文

作  者:罗婧[1,2] 叶志晟 杨泽华 傅天豪 魏雄 汪小林[2,3] 罗英伟 LUO Jing;YE Zhi-sheng;YANG Ze-hua;FU Tian-hao;WEI Xiong;WANG Xiao-lin;LUO Ying-wei(School of Computer Science and Artificial Intelligence,Wuhan Textile University,Wuhan 430200;Pengcheng Laboratory,Shenzhen 518000;School of Computer Science,Peking University,Beijing 100871,China)

机构地区:[1]武汉纺织大学计算机与人工智能学院,湖北武汉430200 [2]鹏城实验室,广东深圳518000 [3]北京大学计算机学院,北京100871

出  处:《计算机工程与科学》2024年第12期2128-2137,共10页Computer Engineering & Science

基  金:国家自然科学基金(62032001,62032008)。

摘  要:近年来,随着深度学习模型训练需求增长,研究机构和企业通过搭建共享GPU集群来降低成本和提高效率。现有研究主要关注企业生产类GPU集群的任务调度和资源分配。针对研发类GPU集群鹏城云脑I,进行任务运行时关键指标的监控和数据采集,构建含任务细粒度时序资源使用信息的深度学习训练任务数据集——鹏城云脑I任务数据集。该数据集是首个面向研发类GPU集群公开数据集,揭示了研发类GPU集群中资源利用率低的现象,为研发类GPU集群高资源利用率的调度器设计提供依据和参考,推动任务调度和资源分配机制的研究。In recent years,with the growing demand for training deep learning models,research institutions and enterprises have established shared GPU clusters to reduce costs and improve efficiency.Existing research mainly focuses on task scheduling and resource allocation in enterprise-level GPU clusters.However,this paper focuses on the Pengcheng Cloud Brain I,a research and development GPU cluster,by monitoring and collecting key indicators during task runtime.It constructs a dataset for deep learning training tasks,named the Pengcheng Cloud Brain I Task Dataset,which includes fine-grained time-series resource usage information for tasks.This dataset represents the first publicly available dataset tailored for R&D GPU clusters.It reveals the phenomenon of low resource utilization in R&D GPU clusters and provides a basis and reference for designing schedulers with high resource utilization for R&D GPU clusters,thereby promoting research on task scheduling and resource allocation mechanisms.

关 键 词:GPU集群 深度学习 集群负载 任务数据集 资源利用率 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象