检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:梁毅 丁振兴 赵昱 刘明洁 潘勇[3] 金翊[3] LIANG Yi;DING Zhen-Xing;ZHAO Yu;LIU Ming-Jie;PAN Yong;JIN Yi(Faculty of Information,Beijing University of Technology,Beijing 100124;Institute of Beijing Electro-Mechanical Engineering,Beijing 100074;Beijing Computing Center,Beijing 100094)
机构地区:[1]北京工业大学信息学部,北京100124 [2]北京机电工程研究所,北京100074 [3]北京市计算中心,北京100094
出 处:《计算机学报》2022年第2期302-316,共15页Chinese Journal of Computers
基 金:北京市自然科学基金面上项目(4192007);国家重点研发计划(2017YFC0803300)资助.
摘 要:如何在受限时间内满足深度学习模型的训练精度需求并最小化资源成本是分布式深度学习系统面临的一大挑战.资源和批尺寸超参数配置是优化模型训练精度及资源成本的主要方法.既有工作分别从计算效率和训练精度的角度,对资源及批尺寸超参数进行独立配置.然而,两类配置对于模型训练精度及资源成本的影响具有复杂的依赖关系,既有独立配置方法难以同时达到满足模型训练精度需求及资源成本最小化的目标.针对上述问题,本文提出分布式深度学习系统资源-批尺寸协同优化配置方法.该方法首先依据资源配置和批尺寸超参数配置与模型训练时间和训练精度间的单调函数关系,选取保序回归理论工具,分别建立模型单轮完整训练时间和训练最终精度预测模型;然后协同使用上述模型,以资源成本最小化为目标,求解满足模型训练精度需求的资源和批尺寸优化配置解.本文基于典型分布式深度学习系统TensorFlow对所提出方法进行性能评测.实验结果表明,与既有基于自动化的资源或批尺寸独立配置方法相比,本文提出的协同配置方法最大节约资源成本26.89%.Distributed deep learning systems are the engine of large-scale deep learning model training.As both of the volume of training datasets and the complexity of training models go up,the resource cost for the deep learning model training increases significantly,and hence,becomes a new concern of the distributed deep learning systems.In the distributed deep learning systems,the resource allocation refers to the number of computing nodes allocated to a parallel model training job and the batch sizing determines the training data size processed by a single training task.The empirical studies demonstrate that,from the perspective of the resource cost optimization,there is complex interdependence between the configurations of resource allocation and batch sizing.However,extant works ignore such interdependence and only take these two configuration methods as the independent ways to optimize the accuracy and computational efficiency of distributed deep learning model training respectively,and hence,are difficult to meet both goals of maximizing training accuracy with the training time constraint and minimizing resource cost.Aiming at this issue,a collaborative configuration method of resource allocation and batch sizing is proposed for distributed deep learning systems in this paper.Here,the resource cost is defined as the product of resource allocation and the training time.The proposed collaborative method is designed based on the observation that both function relationships of the resource allocation to the training time,and the batch sizing to the training accuracy,are monotonic.In the proposed method,the training accuracy prediction model and training time prediction model are first established with the isotonic regression technique.The training time prediction model is established as a function of the resource allocation and batch sizing and it can predict the elapsed time of one training epoch.The training accuracy prediction model is a function of the bath sizing and the total number of training epochs and it can
关 键 词:分布式深度学习系统 模型训练 批尺寸 资源配置 资源成本
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.117.246.69