检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:程玉虎 安冰清 孔毅 CHENG Yu-hu;AN Bing-qing;KONG Yi(School of Information and Control Engineering,China University of Mining and Technology,Xuzhou 221116,China)
机构地区:[1]中国矿业大学信息与控制工程学院,江苏徐州221116
出 处:《控制与决策》2025年第3期1015-1023,共9页Control and Decision
基 金:国家自然科学基金项目(62176259,62006232).
摘 要:值函数估计偏差修正已成为深度强化学习领域的一个重要研究方向.现有大多数研究工作均聚焦于如何缓解高估偏差,却忽略了缓解高估偏差过程中引入的低估偏差问题.为此,通过在Actor-Critic框架中灵活设置多个Actor和Critic网络来缓解值函数低估偏差,提出一种基于组合网络优化的延迟深度确定性策略梯度(D3PGCNO).D3PG-CNO的主要思路为:在经验收集阶段用一个Critic网络对多个Actor网络的输出动作进行评估,并选择最优的动作存入经验池.在经验训练阶段,从多个Critic网络中选出在当前状态-动作对下估计结果最小的Critic网络,并用其对多个Actor网络的输出动作进行评估,选择评估最大值进行目标值的计算.MuJoCo平台上的实验结果显示,相比于现有的确定性策略梯度算法,D3PG-CNO显著降低了估计偏差,提高了算法的稳定性和收敛速度,并在多个任务中表现出更好的性能.In recent years,value function estimation bias correction has become an important research direction in the field of deep reinforcement learning.Most existing research work focuses on how to alleviate overestimation bias,but ignores the problem of underestimation bias introduced in the process of mitigating overestimation bias.To this end,this paper flexibly sets up multiple Actor and Critic networks in the Actor-Critic framework to alleviate the value function underestimation bias,and proposes a delayed depth deterministic policy gradient based on combined network optimization(D3PG-CNO).The main idea of the D3PG-CNO is to use a Critic network to evaluate the output actions of multiple Actor networks in the experience collection phase,and to select the optimal actions to store in the experience pool.In the experience training stage,the Critic network with the smallest estimated result under the current state-action pair is selected from multiple Critic networks and used to evaluate the output actions of multiple Actor networks,and the maximum evaluation value is selected to calculate the target value.Experimental results on the MuJoCo platform show that the D3PG-CNO significantly reduces estimation bias compared to existing deterministic policy gradient algorithms,improves the stability and convergence speed of the algorithm,and shows better performance in multiple tasks.
关 键 词:深度强化学习 低估偏差 确定性策略梯度 Actor-Critic框架 值函数
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7