带Q网络过滤的两阶段TD3深度强化学习方法被引量：3

Two-stage TD3 Deep Reinforcement Learning Algorithm with Q Network Filtration

作　　者：周娴玮包明豪叶鑫余松森 ZHOU Xian-wei;BAO Ming-hao;YE Xin;YU Song-sen(School of Software,South China Normal University,Foshan 528000,China)

机构地区：[1]华南师范大学软件学院,广东佛山528000

出　　处：《计算机技术与发展》2023年第10期101-108,共8页Computer Technology and Development

基　　金：广东省应用型科技研发重大专项(2016B020244003);广东省基础与应用基础研究基金(2020B1515120089,2020A1515110783);广东省企业科技特派员项目(GDKTP2020014000)。

摘　　要：常规的深度强化学习模型训练方式从“零”开始,其起始策略为随机初始化,这将导致智能体在训练前期阶段探索效率低、样本学习率低,网络难以收敛,该阶段也被称为冷启动过程。为解决冷启动问题,目前大多数工作使用两阶段深度强化学习训练方式;但是使用这种方式的智能体由模仿学习过渡至深度强化学习阶段后可能会出现遗忘演示动作的情况,表现为性能和回报突然性回落。因此,该文提出一种带Q网络过滤的两阶段TD3深度强化学习方法。首先,通过收集专家演示数据,使用模仿学习-行为克隆以及TD3模型Q网络更新公式分别对Actor网络与Critic网络进行预训练工作;进一步地,为避免预训练后的Actor网络在策略梯度更新时误选择估值过高的演示数据集之外动作,从而遗忘演示动作,提出Q网络过滤算法,过滤掉预训练Critic网络中过高估值的演示数据集之外的动作估值,保持演示动作为最高估值动作,有效缓解遗忘现象。在Deep Mind提供的Mujoco机器人仿真平台中进行实验,验证了所提算法的有效性。Training of conventional deep reinforcement learning model starts from“zero”,with random initialization strategy,which leads to low exploration efficiency,low sample learning rate and low network convergence of the agent in the early stage of training,which is also known as the cold start problem.To solve the problem,most of the current work use the two-stage deep reinforcement learning training mode.However,the agent using this method may forget the demonstration action after the transition from imitation learning to deep reinforcement learning,which is manifested as an abrupt decline in performance and reward.Therefore,a two-stage TD3 deep reinforcement learning method with Q network filtering is proposed.Firstly,collecting expert demonstration data,the pre-training of Actor network and Critic network is carried out respectively by using imitation learning-behavior cloning and TD3 model Q network update formula.Further,in order to avoid the pre-trained Actor network mistakenly selecting overvalued actions out of the demonstration data set while the strategy gradient update is taking place,resulting in forgetting the demonstration actions,we propose a Q network filtering algorithm to filter out the overvalued action outside the demonstration data set in the pre-trained Critic network,and keep the demonstration actions as the highest value actions,effectively alleviating the phenomenon of forgetting.Experiments were carried out on the Mujoco robot simulation platform provided by Deep Mind to verify the effectiveness of the proposed algorithm.

关键词：两阶段深度强化学习冷启动问题模仿学习预训练网络 TD3

分类号：TP391.9[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

带Q网络过滤的两阶段TD3深度强化学习方法被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

带Q网络过滤的两阶段TD3深度强化学习方法 被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

带Q网络过滤的两阶段TD3深度强化学习方法被引量：3