异构分布式深度学习平台的构建和优化方法研究  被引量:2

Research on the construction and optimization of heterogeneous distributed deep learning platform

在线阅读下载全文

作  者:胡昌秀 张仰森[1,2] 彭爽 陈涵[1] 祁浩家 HU Changxiu;ZHANG Yangsen;PENG Shuang;CHEN Han;QI Haojia(Institute of Intelligent Information,Beijing Information Science&Technology University,Beijing 100101,China;National Economic Security Early Warning Engineering Beijing Laboratory,Beijing 100044,China;College of Arts,Northeast Normal University,Changchun 130022,China)

机构地区:[1]北京信息科技大学智能信息处理研究所,北京100101 [2]国家经济安全预警工程北京实验室,北京100044 [3]东北师范大学文学院,长春130022

出  处:《重庆理工大学学报(自然科学)》2023年第9期208-216,共9页Journal of Chongqing University of Technology:Natural Science

基  金:国家社科基金重大项目(21&ZD287)。

摘  要:深度学习与大数据技术的结合在资源管理、任务调度等方面还存在许多问题,有待解决与优化。针对异构资源管理能力弱、原生调度算法灵活性差、多框架缺少统一的使用接口3个问题,提出了一种异构资源下分布式深度学习框架整合平台,并对任务调度算法的优化进行研究。平台以Spark框架为基础,向下对异构资源进行拓展与管理,向上整合了SparkOnAngel与TensorFlowOnSpark 2种框架,使用物理标注的方法,为挂载不同计算资源的机器打上不同的标签,并借助资源模型的双重表示,进行调度算法优化。结果表明:该平台与传统的spark集群相比,在5个minist_spark与5个WordCount混合任务场景下,执行耗时降低13.4%;在大批量的WordCount任务场景下,当作业量达到60时,执行耗时可降低至32.31%。平台能够扩展对GPU资源的管理,调度算法更加灵活高效,可为多个框架提供统一的调用接口。The combination of deep learning and big data technology is the general trend.There are still many problems to be solved and optimized in terms of resource management and task scheduling.Aiming at the three problems of weak management ability of heterogeneous resources,poor flexibility of native scheduling algorithms,and lack of a unified interface for multiple frameworks,a distributed deep learning framework integration platform under heterogeneous resources is proposed,and the optimization of task scheduling algorithms is studied.Based on the Spark framework,the platform expands and manages heterogeneous resources downwards,integrates the two frameworks SparkOnAngel and TensorFlowOnSpark upwards,and uses physical labeling to label machines with different computing resources.The dual representation of the model is used to optimize the scheduling algorithm.The results show that compared with the traditional spark cluster,the execution time of this platform is reduced by 13.4%in the mixed task scenario of 5 minist_spark and 5 WordCount tasks;can be reduced to 32.31%.The platform can expand the management of GPU resources,make the scheduling algorithm more flexible and efficient,and provide a unified calling interface for multiple frameworks.

关 键 词:异构 调度算法 资源管理 统一接口 

分 类 号:TP393[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象