基于裁剪优化和策略指导的近端策略优化算法  

Proximal policy optimization algorithm based on clipping optimization and policy guidance

在线阅读下载全文

作  者:周毅 高华 田永谌 ZHOU Yi;GAO Hua;TIAN Yongshen(School of Information Science and Engineering,Wuhan University of Science and Technology,Wuhan Hubei 430081,China)

机构地区:[1]武汉科技大学信息科学与工程学院,武汉430081

出  处:《计算机应用》2024年第8期2334-2341,共8页journal of Computer Applications

基  金:国家自然科学基金资助项目(62372343)。

摘  要:针对近端策略优化(PPO)算法难以严格约束新旧策略的差异和探索与利用效率较低这2个问题,提出一种基于裁剪优化和策略指导的PPO(COAPG-PPO)算法。首先,通过分析PPO的裁剪机制,设计基于Wasserstein距离的信任域裁剪方案,加强对新旧策略差异的约束;其次,在策略更新过程中,融入模拟退火和贪心算法的思想,提升算法的探索效率和学习速度。为了验证所提算法的有效性,使用MuJoCo测试基准对COAPG-PPO与CO-PPO(PPO based on Clipping Optimization)、PPO-CMA(PPO with Covariance Matrix Adaptation)、TR-PPO-RB(Trust Region-based PPO with RollBack)和PPO算法进行对比实验。实验结果表明,COAPG-PPO算法在大多数环境中具有更严格的约束能力、更高的探索和利用效率,以及更高的奖励值。Addressing the two issues in the Proximal Policy Optimization(PPO)algorithm,the difficulty in strictly constraining the difference between old and new policies and the relatively low efficiency in exploration and utilization,a PPO based on Clipping Optimization And Policy Guidance(COAPG-PPO)algorithm was proposed.Firstly,by analyzing the clipping mechanism of PPO,a trust-region clipping approach based on the Wasserstein distance was devised,strengthening the constraint on the difference between old and new policies.Secondly,within the policy updating process,ideas from simulated annealing and greedy algorithms were incorporated,improving the exploration efficiency and learning speed of algorithm.To validate the effectiveness of COAPG-PPO algorithm,comparative experiments were conducted using the MuJoCo testing benchmarks between PPO based on Clipping Optimization(CO-PPO),PPO with Covariance Matrix Adaptation(PPO-CMA),Trust Region-based PPO with RollBack(TR-PPO-RB),and PPO algorithm.The experimental results indicate that COAPG-PPO algorithm demonstrates stricter constraint capabilities,higher exploration and exploitation efficiencies,and higher reward values in most environments.

关 键 词:深度强化学习 近端策略优化 信任域约束 模拟退火 贪心算法 

分 类 号:TP183[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象