基于自指导动作选择的近端策略优化算法  被引量:7

Proximal Policy Optimization Based on Self-directed Action Selection

在线阅读下载全文

作  者:申怡 刘全[1,2,3,4] SHEN Yi;LIU Quan(School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China;Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China;Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China)

机构地区:[1]苏州大学计算机科学与技术学院,江苏苏州215006 [2]苏州大学江苏省计算机信息处理技术重点实验室,江苏苏州215006 [3]吉林大学符号计算与知识工程教育部重点实验室,长春130012 [4]软件新技术与产业化协同创新中心,南京210000

出  处:《计算机科学》2021年第12期297-303,共7页Computer Science

基  金:国家自然科学基金(61772355,61702055,61502323,61502329);江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18);苏州市应用基础研究计划工业部分(SYG201422);江苏省高校优势学科建设工程资助项目。

摘  要:强化学习领域中策略单调提升的优化算法是目前的一个研究热点,在离散型和连续型控制任务中都具有了良好的性能表现。近端策略优化(Proximal Policy Optimization,PPO)算法是一种经典策略单调提升算法,但PPO作为一种同策略(on-policy)算法,样本利用率较低。针对该问题,提出了一种基于自指导动作选择的近端策略优化算法(Proximal Policy Optimization Based on Self-Directed Action Selection,SDAS-PPO)。SDAS-PPO算法不仅根据重要性采样权重对样本经验进行利用,而且增加了一个同步更新的经验池来存放自身的优秀样本经验,并利用该经验池学习到的自指导网络对动作的选择进行指导。SDAS-PPO算法大大提高了样本利用率,并保证训练网络模型时智能体能快速有效地学习。为了验证SDAS-PPO算法的有效性,将SDAS-PPO算法与TRPO算法、PPO算法和PPO-AMBER算法用于连续型控制任务Mujoco仿真平台中进行比较实验。实验结果表明,该方法在绝大多数环境下具有更好的表现。The optimization algorithm of monotonous improvement of strategy in reinforcement learning is a current research hotspot,and it has achieved good performance in both discrete and continuous control tasks.Proximal policy optimization(PPO)algorithm is a classic strategy monotonic promotion algorithm,but it is an on-policy algorithm with low sample utilization.To solve this problem,an algorithm named proximal policy optimization based on self-directed action selection(SDAS-PPO)is proposed.The SDAS-PPO algorithm not only uses the sample experience according to the importance sampling weight,but also adds a synchronously updated experience pool to store its own excellent sample experience,and uses the self-directed network learned from the experience pool to guide the choice of actions.The SDAS-PPO algorithm greatly improves the sample utilization rate and ensures that the intelligent body can learn quickly and effectively when training the network model.In order to verify the effectiveness of the SDAS-PPO algorithm,the SDAS-PPO algorithm and the TRPO algorithm,PPO algorithm and PPO-AMBER algorithm are used in the continuous control task Mujoco simulation platform for comparative experiments.Experimental results show that this method has better performance in most environments.

关 键 词:强化学习 深度强化学习 策略梯度 近端策略优化 自指导 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象