基于注意力的循环PPO算法及其应用

Attention-based Recurrent PPO Algorithm and Its Application

作　　者：吕相霖臧兆祥[1,2] 李思博王俊英[1,2] LYU Xiang-lin;ZANG Zhao-xiang;LI Si-bo;WANG Jun-ying(Hubei Key Laboratory of Intelligent Visual Monitoring for Hydropower Engineering,Three Gorges University,Yichang 443002,China;School of Computer and Information,Three Gorges University,Yichang 443002,China)

机构地区：[1]三峡大学水电工程智能视觉监测湖北省重点实验室,湖北宜昌443002 [2]三峡大学计算机与信息学院,湖北宜昌443002

出　　处：《计算机技术与发展》2024年第1期136-142,共7页Computer Technology and Development

基　　金：国家自然科学基金(61502274);湖北省自然科学基金(2015CFB336)。

摘　　要：针对深度强化学习算法在部分可观测环境中面临信息掌握不足、存在随机因素等问题,提出了一种融合注意力机制与循环神经网络的近端策略优化算法(ARPPO算法)。该算法首先通过卷积网络层提取特征;其次采用注意力机制突出状态中重要的关键信息;再次通过LSTM网络提取数据的时域特性;最后基于Actor-Critic结构的PPO算法进行策略学习与训练提升。基于Gym-Minigrid环境设计了两项探索任务的消融与对比实验,实验结果表明ARPPO算法较已有的A2C算法、PPO算法、RPPO算法具有更快的收敛速度,且ARPPO算法在收敛之后具有很强的稳定性,并对存在随机因素的未知环境具备更强的适应力。A proximal policy optimization model based on attention mechanism and recurrent neural network(ARPPO)is proposed to address the problems faced by deep reinforcement learning algorithms in partially observable environments,such as insufficient information about the environment and randomness factors.The algorithm first processes the encoded information of environmental images through convolutional network layers;then highlights important key information in states using attention mechanism;then extracts temporal characteristics of data through LSTM network;finally improves policy learning and training based on PPO with Actor-Critic structure.Ablation and comparative experiments of two exploration tasks were designed based on the Gym-Minigrid environment.The experimental results show that ARPPO has faster training speed and stronger stability compared with A2C,PPO and RPPO,and has stronger adaptability to unknown environments with random factors.

关键词：深度强化学习部分可观测注意力机制 LSTM网络近端策略优化算法

分类号：TP242.6[自动化与计算机技术—检测技术与自动化装置]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于注意力的循环PPO算法及其应用

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于注意力的循环PPO算法及其应用

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索