检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:周娴玮 包明豪 叶鑫 余松森 ZHOU Xian-wei;BAO Ming-hao;YE Xin;YU Song-sen(School of Software,South China Normal University,Foshan 528000,China)
出 处:《计算机技术与发展》2023年第10期101-108,共8页Computer Technology and Development
基 金:广东省应用型科技研发重大专项(2016B020244003);广东省基础与应用基础研究基金(2020B1515120089,2020A1515110783);广东省企业科技特派员项目(GDKTP2020014000)。
摘 要:常规的深度强化学习模型训练方式从“零”开始,其起始策略为随机初始化,这将导致智能体在训练前期阶段探索效率低、样本学习率低,网络难以收敛,该阶段也被称为冷启动过程。为解决冷启动问题,目前大多数工作使用两阶段深度强化学习训练方式;但是使用这种方式的智能体由模仿学习过渡至深度强化学习阶段后可能会出现遗忘演示动作的情况,表现为性能和回报突然性回落。因此,该文提出一种带Q网络过滤的两阶段TD3深度强化学习方法。首先,通过收集专家演示数据,使用模仿学习-行为克隆以及TD3模型Q网络更新公式分别对Actor网络与Critic网络进行预训练工作;进一步地,为避免预训练后的Actor网络在策略梯度更新时误选择估值过高的演示数据集之外动作,从而遗忘演示动作,提出Q网络过滤算法,过滤掉预训练Critic网络中过高估值的演示数据集之外的动作估值,保持演示动作为最高估值动作,有效缓解遗忘现象。在Deep Mind提供的Mujoco机器人仿真平台中进行实验,验证了所提算法的有效性。Training of conventional deep reinforcement learning model starts from“zero”,with random initialization strategy,which leads to low exploration efficiency,low sample learning rate and low network convergence of the agent in the early stage of training,which is also known as the cold start problem.To solve the problem,most of the current work use the two-stage deep reinforcement learning training mode.However,the agent using this method may forget the demonstration action after the transition from imitation learning to deep reinforcement learning,which is manifested as an abrupt decline in performance and reward.Therefore,a two-stage TD3 deep reinforcement learning method with Q network filtering is proposed.Firstly,collecting expert demonstration data,the pre-training of Actor network and Critic network is carried out respectively by using imitation learning-behavior cloning and TD3 model Q network update formula.Further,in order to avoid the pre-trained Actor network mistakenly selecting overvalued actions out of the demonstration data set while the strategy gradient update is taking place,resulting in forgetting the demonstration actions,we propose a Q network filtering algorithm to filter out the overvalued action outside the demonstration data set in the pre-trained Critic network,and keep the demonstration actions as the highest value actions,effectively alleviating the phenomenon of forgetting.Experiments were carried out on the Mujoco robot simulation platform provided by Deep Mind to verify the effectiveness of the proposed algorithm.
关 键 词:两阶段深度强化学习 冷启动问题 模仿学习 预训练网络 TD3
分 类 号:TP391.9[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49