基于情节经验回放的深度确定性策略梯度方法被引量：8

Deep Deterministic Policy Gradient with Episode Experience Replay

作　　者：张建行刘全[1,2,3,4] ZHANG Jian-hang;LIU Quan(School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China;Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China;Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China)

机构地区：[1]苏州大学计算机科学与技术学院,江苏苏州215006 [2]苏州大学江苏省计算机信息处理技术重点实验室,江苏苏州215006 [3]吉林大学符号计算与知识工程教育部重点实验室,长春130012 [4]软件新技术与产业化协同创新中心,南京210000

出　　处：《计算机科学》2021年第10期37-43,共7页Computer Science

基　　金：国家自然科学基金(61772355,61702055,61502323,61502329);江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18);苏州市应用基础研究计划工业部分(SYG201422);江苏省高校优势学科建设工程资助项目。

摘　　要：强化学习中的连续控制问题一直是近年来的研究热点。深度确定性策略梯度(Deep Deterministic Policy Gradients,DDPG)算法在连续控制任务中表现优异。DDPG算法利用经验回放机制训练网络模型,为了进一步提高经验回放机制在DDPG算法中的效率,将情节累积回报作为样本分类依据,提出一种基于情节经验回放的深度确定性策略梯度(Deep Determinis-tic Policy Gradient with Episode Experience Replay,EER-DDPG)方法。首先,将经验样本以情节为单位进行存储,根据情节累积回报大小使用两个经验缓冲池分类存储。然后,在网络模型训练阶段着重对累积回报较大的样本进行采样,以提升训练质量。在连续控制任务中对该方法进行实验验证,并与采取随机采样的DDPG方法、置信区域策略优化(Trust Region Policy Optimization,TRPO)方法以及近端策略优化(Proximal Policy Optimization,PPO)方法进行比较。实验结果表明,EER-DDPG方法有更好的性能表现。The research on continuous control in reinforcement learning has been a hot topic in recent years.The deep deterministic policy gradient(DDPG)algorithm performs well in continuous control tasks.DDPG algorithm uses experience replay mechanism to train the network model,and in order to further improve the efficiency of experience replay mechanism in the DDPG algorithm,the cumulative reward is used as the transition classification basis,a deep deterministic policy gradient with episodic experien ce replay(EER-DDPG)algorithm is proposed.First of all,the transitions are stored in the unit of episode,and two replay buffers are introduced respectively to classify the transitions according to the cumulative reward.Then,the quality of policy can be improved in network model training period by random sampling of the episodes with large cumulative reward.In the continuous control tasks,this algorithm is verified by experiments,and compared with DDPG algorithm,trust region policy optimization(TRPO)algorithm and proximal policy optimization(PPO)algorithm.The experimental results show that EER-DDPG algorithm has better performance.

关键词：深度确定性策略梯度连续控制任务经验回放累积回报分类经验回放

分类号：TP181[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于情节经验回放的深度确定性策略梯度方法被引量：8

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于情节经验回放的深度确定性策略梯度方法 被引量：8

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于情节经验回放的深度确定性策略梯度方法被引量：8