机构地区:[1]中国科学院大学,北京100190 [2]中国科学院软件研究所软件工程技术研究开发中心,北京100190 [3]中国科学院软件研究所计算机科学国家重点实验室,北京100190
出 处:《计算机学报》2022年第5期973-992,共20页Chinese Journal of Computers
基 金:国家重点研发计划(2018YFB1003602);国家自然科学基金(61872344);北京市自然科学基金(4182070);中科院青促会人才专项(2018144);阿里巴巴2018年度创新研究(AIR)项目资助。
摘 要:状态可变数据流(Mutable States Data Flow,MS-DF)是机器学习系统运行时的主要特征,MS-DF可由有向图来表示,其顶点由算子构成,表示机器学习运算逻辑;边代表算子之间的输入输出依赖关系.MS-DF的集群调度是保障机器学习系统高效运行的主要工作,如何高效进行MS-DF的集群调度已经成为机器学习的研究热点.其中,机器学习系统(TensorFlow、PyTorch等)作为中间层解耦了机器学习运算逻辑和资源分配(CPU,GPU,FGPA),从而机器学习无需再“独占式”静态绑定资源,而是由机器学习系统运行时动态管理,而算子是该解耦过程的关键要素,这给MS-DF的集群调度带来了新的挑战,这些挑战主要由算子资源需求刻画的准确性、算子调度决策的适应性和算子调度调整的差异性这三方面导致的.首先介绍算子资源需求的感知、协同两个机制,以克服多种算子组合导致其自身资源需求难以准确刻画的挑战;然后,通过决策约束、决策模型和决策求解来介绍算子调度决策,以应对算子状态频繁变化带来的适应性挑战;接着,介绍迁移、伸缩、挂起恢复等算子调度调整策略,以适用于不同算子状态同步方式带来的差异性挑战.最后,基于上述三个挑战,对近年来的集群调度最新研究成果进行归纳和分析,并展望MS-DF的集群调度,指出算子异构资源需求多层次分析及协同刻画、算子复杂调度约束的灵活定义和发现、学习驱动的算子低成本调度调整技术是其主要发展方向.Mutable States Data Flow(MS-DF),as a main runtime feature of machine learning systems(e.g.Tensorflow,PyTorch,MxNet),can be represented by a directed graph.Here,each vertex in a MS-DF graph denotes a single operation(e.g.Conv2D,MatMul)which consists of typical machine learning computing processes.And each edge connecting two operations denotes the dependency of these two operations,the term“dependency”means the output of an operation is the input of the other operation linked by an edge.Currently,cluster scheduling for MS-DF is one of the main works that can guarantee the execution efficiency of machine learning systems,and it is one of the hot research topics in machine learning system area.Diving into the principle of cluster scheduling for MS-DF,machine learning systems are key factors that affect the performance of cluster scheduling,since they work as a middle layer to decouple the computing of machine learning and cluster resource(e.g.CPU,GPU and FPGA)allocations.By this way,cluster resources are no longer exclusively and statically bound to one computation of machine learning.Instead,machine learning systems may manage different kinds of resources dynamically,but at the cost of increased complexity of cluster scheduling for MS-DF.Under this circumstance,machine learning operations can heavily affect the dynamic management of cluster resources,thus new challenges arise.We demonstrate that these challenges are caused by the following three aspects:(1)the accuracy of profiling operations’heterogeneous resource requirements.(2)the adaptability of operation scheduling decisions.(3)the variability of operation scheduling adjustments.In addition to above challenges,we analyze and summarize the latest researches of cluster scheduling for MS-DF in recent years,and how these researches cope with challenges.(1)We introduce the mechanisms of both machine learning operation perception and cooperation that can be used to profile operations’heterogeneous resource requirement,such mechanisms can estimate the actual
关 键 词:机器学习系统 状态可变数据流 机器学习算子 算子资源需求刻画 算子调度决策 算子调度调整
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...