面向状态可变数据流的集群调度综述被引量：1

State-of-the-Art Survey of Cluster Scheduling for Mutable States Data Flow

作　　者：许源佳吴恒[2] 杨晨吴悦文张文博[2,3] 王焘[2,3] XU Yuan-Jia;WU Heng;YANG Chen;WU Yue-Wen;ZHANG Wen-Bo;WANG Tao(University of Chinese Academy of Sciences,Beijing 100190;Software Engineering Technology Research and Development Center,Institute of Software,Chinese Academy of Science,Beijing 100190;State Key Laboratory of Computer Science,Institute of Software,Chinese Academy of Science,Beijing 100190)

机构地区：[1]中国科学院大学,北京100190 [2]中国科学院软件研究所软件工程技术研究开发中心,北京100190 [3]中国科学院软件研究所计算机科学国家重点实验室,北京100190

出　　处：《计算机学报》2022年第5期973-992,共20页Chinese Journal of Computers

基　　金：国家重点研发计划(2018YFB1003602);国家自然科学基金(61872344);北京市自然科学基金(4182070);中科院青促会人才专项(2018144);阿里巴巴2018年度创新研究(AIR)项目资助。

摘　　要：状态可变数据流(Mutable States Data Flow,MS-DF)是机器学习系统运行时的主要特征,MS-DF可由有向图来表示,其顶点由算子构成,表示机器学习运算逻辑;边代表算子之间的输入输出依赖关系.MS-DF的集群调度是保障机器学习系统高效运行的主要工作,如何高效进行MS-DF的集群调度已经成为机器学习的研究热点.其中,机器学习系统(TensorFlow、PyTorch等)作为中间层解耦了机器学习运算逻辑和资源分配(CPU,GPU,FGPA),从而机器学习无需再“独占式”静态绑定资源,而是由机器学习系统运行时动态管理,而算子是该解耦过程的关键要素,这给MS-DF的集群调度带来了新的挑战,这些挑战主要由算子资源需求刻画的准确性、算子调度决策的适应性和算子调度调整的差异性这三方面导致的.首先介绍算子资源需求的感知、协同两个机制,以克服多种算子组合导致其自身资源需求难以准确刻画的挑战;然后,通过决策约束、决策模型和决策求解来介绍算子调度决策,以应对算子状态频繁变化带来的适应性挑战;接着,介绍迁移、伸缩、挂起恢复等算子调度调整策略,以适用于不同算子状态同步方式带来的差异性挑战.最后,基于上述三个挑战,对近年来的集群调度最新研究成果进行归纳和分析,并展望MS-DF的集群调度,指出算子异构资源需求多层次分析及协同刻画、算子复杂调度约束的灵活定义和发现、学习驱动的算子低成本调度调整技术是其主要发展方向.Mutable States Data Flow(MS-DF),as a main runtime feature of machine learning systems(e.g.Tensorflow,PyTorch,MxNet),can be represented by a directed graph.Here,each vertex in a MS-DF graph denotes a single operation(e.g.Conv2D,MatMul)which consists of typical machine learning computing processes.And each edge connecting two operations denotes the dependency of these two operations,the term“dependency”means the output of an operation is the input of the other operation linked by an edge.Currently,cluster scheduling for MS-DF is one of the main works that can guarantee the execution efficiency of machine learning systems,and it is one of the hot research topics in machine learning system area.Diving into the principle of cluster scheduling for MS-DF,machine learning systems are key factors that affect the performance of cluster scheduling,since they work as a middle layer to decouple the computing of machine learning and cluster resource(e.g.CPU,GPU and FPGA)allocations.By this way,cluster resources are no longer exclusively and statically bound to one computation of machine learning.Instead,machine learning systems may manage different kinds of resources dynamically,but at the cost of increased complexity of cluster scheduling for MS-DF.Under this circumstance,machine learning operations can heavily affect the dynamic management of cluster resources,thus new challenges arise.We demonstrate that these challenges are caused by the following three aspects:(1)the accuracy of profiling operations’heterogeneous resource requirements.(2)the adaptability of operation scheduling decisions.(3)the variability of operation scheduling adjustments.In addition to above challenges,we analyze and summarize the latest researches of cluster scheduling for MS-DF in recent years,and how these researches cope with challenges.(1)We introduce the mechanisms of both machine learning operation perception and cooperation that can be used to profile operations’heterogeneous resource requirement,such mechanisms can estimate the actual

关键词：机器学习系统状态可变数据流机器学习算子算子资源需求刻画算子调度决策算子调度调整

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向状态可变数据流的集群调度综述被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向状态可变数据流的集群调度综述 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

面向状态可变数据流的集群调度综述被引量：1