检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王慧[1] 李虹[1] 何秋生[1] 李占龙[2] WANG Hui;LI Hong;HE Qiu-sheng;LI Zhan-long(Taiyuan University of Science and Technology,School of Electronic and Information Engineering,Taiyuan Shanxi 030024,China;Taiyuan University of Science and Technology,School of Vehicle and Traffic Engineering,Taiyuan Shanxi 030024,China)
机构地区:[1]太原科技大学电子信息工程学院,山西太原030024 [2]太原科技大学车辆与交通工程学院,山西太原030024
出 处:《计算机仿真》2025年第3期404-409,436,共7页Computer Simulation
基 金:国家自然科学基金项目(52272401)。
摘 要:针对传统的近端策略优化(PPO)惩罚算法在训练过程中存在收敛性不好的问题,提出一种改进的PPO惩罚算法。通过将基于常量自适应更新惩罚系数的方法改为基于函数自适应更新的方法,使惩罚系数与散度相关联,并随着散度的变化以一定的趋势发生改变,从而改善算法的收敛性和学习的可靠性,上述方法使得算法更加灵活且适应性更强。经仿真验证,改进的PPO惩罚算法在收敛性和学习可靠性方面优于传统的PPO惩罚算法,并使用分布式PPO算法进一步验证了改进方法的有效性,为后续强化学习算法的研究提供了新的思路和方法。An improved Proximal Policy Optimization(PPO)penalty algorithm is proposed to address the issue of poor convergence in the training process of traditional Proximal Policy Optimization penalty algorithms.By changing the method of updating penalty coefficients based on constant adaptation to the method of function adaptation,the penalty coefficients are associated with divergence and change in a certain trend with the variation of divergence,thereby improving the convergence and learning reliability of the algorithm.This method makes the algorithm more flexible and adaptable.The simulation results show that the improved Proximal Policy Optimization penalty algorithm is superior to the traditional Proximal Policy Optimization penalty algorithm in terms of convergence and learning reliability.The distributed Proximal Policy Optimization algorithm is used to further verify the effectiveness of the improved method,which provides a new idea and method for the subsequent research of Reinforcement learning algorithms.
分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.249