面向执行-学习者的在线强化学习并行训练方法被引量：4

PALA:Parallel Actor-Learner Architecture for Distributed Deep Reinforcement Learning

作　　者：孙正伦乔鹏窦勇[1] 李青青李荣春 SUN Zheng-Lun;QIAO Peng;DOU Yong;LI Qing-Qing;LI Rong-Chun(Department of Science and Technology on Parallel and Distributed Processing Laboratory,National University of Defense Technology,Changsha 410073)

机构地区：[1]国防科技大学并行与分布处理国家重点实验室,长沙410073

出　　处：《计算机学报》2023年第2期229-243,共15页Chinese Journal of Computers

基　　金：国家自然科学基金(61732018、61902415、61972409);重点实验室开放基金(WDZC20205500104)资助。

摘　　要：近年来,深度强化学习(Deep Reinforcement Learning,DRL)已经成为了人工智能领域中的研究热点.为了加速DRL训练,人们提出了分布式强化学习方法用于提升训练速度.目前分布式强化学习可以分为同策略方法、异策略方法以及最新的近同策略方法.近同策略方法改善了同策略方法和异策略方法的问题,但是由于其共享内存并行模型的限制,近同策略模型难以扩展到以网络互连的计算集群上,低可扩展性限制了近同策略方法能够利用的资源数量,增加了计算节点的负载,最终导致训练耗时增加.为了提升近同策略方法的可扩展性,提升收敛速度,本文提出了一种以消息传递为基础,使用Gossip算法与模型融合方法的并行执行者-学习者训练框架(Parallel Actor-Learner Architecture,PALA),这一方法通过增强训练的并行性和可扩展性来提升收敛速度.首先,该框架以Gossip算法作为通信基础,借助全局数据代理并使用消息传递模型创建了一套可扩展的多个并行单智能体训练方法.其次,为了保证探索-利用的同策略性,维持训练稳定,本文创建了一套可以用于多机之间进行隐式同步的进程锁.其次,本文面向含有CUDA张量的模型数据,提出了一种序列化方法,以保证模型数据能够通过节点间网络传递、聚合.最后,本文使用模型聚合方法对训练进行加速.基于上述优化和改进,PALA训练方法能够将负载均衡地映射到整个计算集群上,减少由于高负载而造成的长等待时间,提升收敛速度.实验表明,相较于之前使用共享内存模式的方法,PALA训练的智能体在达到相同水平时,训练时间缩减了20%以上,同时,PALA还有着较好的可扩展性,PALA可以扩展的硬件资源数量是原有方法的6倍以上.与其他方法相对比,PALA训练的智能体最终策略在几乎所有测试环境中达到了最优水平.In recent years,Deep Reinforcement Learning has become a hot spot in the field of Artificial Intelligence,it has reached great success in many complex environments and even outperforms humans in several complex games.To accelerate DRL training,researchers proposed distributed reinforcement learning to improve training speed and scalability.At present,there are three types of distributed reinforcement learning:on-policy,off-policy,and near on-policy.Near on-policy solve the problem of on-policy and off-policy and improve training efficiency,but the near on-policy method is based on shared memory parallel model.Due to the limitation of the shared memory parallel model,near on-policy have trouble expanding to the cluster connected with the network.This problem made the near on-policy method hard to scale and limited the resource near on-policy can use,which ultimately lead to the increment of the workload in computation peer and the increment of the training time.In order to improve the scalability and speed up the convergence of near on-policy,we propose a parallel actor-learner architecture(Parallel Actor-Learner Architecture,PALA).The parallel model of this architecture is based on message passing parallel model,using the Gossip Algorithm and model average method.First,with the help of data proxy and message passing parallel model which performed in the Gossip algorithm way,we proposed a salable parallel training architecture of single agent training.This architecture could expand to multiple nodes connected with the network.Secondly,in order to stabilize training and keep the policy difference between learner and actor within a small bound,we proposed a process lock.This lock could be used in the cluster among multiple nodes for implicit synchronization.Without implicit synchronization,parallel training will not take much effect.Thirdly,this paper proposes a serialization method for model data containing CUDA tensors to ensure that model data can be transmitted and aggregated through the network between nodes.Fi

关键词：Gossip算法强化学习同策略学习分布式强化学习并行训练方法

分类号：TP18[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向执行-学习者的在线强化学习并行训练方法被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向执行-学习者的在线强化学习并行训练方法 被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

面向执行-学习者的在线强化学习并行训练方法被引量：4