基于组合网络优化的延迟深度确定性策略梯度

Delayed deep deterministic policy gradient based on combinatorial network optimization

作　　者：程玉虎安冰清孔毅 CHENG Yu-hu;AN Bing-qing;KONG Yi(School of Information and Control Engineering,China University of Mining and Technology,Xuzhou 221116,China)

机构地区：[1]中国矿业大学信息与控制工程学院,江苏徐州221116

出　　处：《控制与决策》2025年第3期1015-1023,共9页Control and Decision

基　　金：国家自然科学基金项目(62176259,62006232).

摘　　要：值函数估计偏差修正已成为深度强化学习领域的一个重要研究方向.现有大多数研究工作均聚焦于如何缓解高估偏差,却忽略了缓解高估偏差过程中引入的低估偏差问题.为此,通过在Actor-Critic框架中灵活设置多个Actor和Critic网络来缓解值函数低估偏差,提出一种基于组合网络优化的延迟深度确定性策略梯度(D3PGCNO).D3PG-CNO的主要思路为:在经验收集阶段用一个Critic网络对多个Actor网络的输出动作进行评估,并选择最优的动作存入经验池.在经验训练阶段,从多个Critic网络中选出在当前状态-动作对下估计结果最小的Critic网络,并用其对多个Actor网络的输出动作进行评估,选择评估最大值进行目标值的计算.MuJoCo平台上的实验结果显示,相比于现有的确定性策略梯度算法,D3PG-CNO显著降低了估计偏差,提高了算法的稳定性和收敛速度,并在多个任务中表现出更好的性能.In recent years,value function estimation bias correction has become an important research direction in the field of deep reinforcement learning.Most existing research work focuses on how to alleviate overestimation bias,but ignores the problem of underestimation bias introduced in the process of mitigating overestimation bias.To this end,this paper flexibly sets up multiple Actor and Critic networks in the Actor-Critic framework to alleviate the value function underestimation bias,and proposes a delayed depth deterministic policy gradient based on combined network optimization(D3PG-CNO).The main idea of the D3PG-CNO is to use a Critic network to evaluate the output actions of multiple Actor networks in the experience collection phase,and to select the optimal actions to store in the experience pool.In the experience training stage,the Critic network with the smallest estimated result under the current state-action pair is selected from multiple Critic networks and used to evaluate the output actions of multiple Actor networks,and the maximum evaluation value is selected to calculate the target value.Experimental results on the MuJoCo platform show that the D3PG-CNO significantly reduces estimation bias compared to existing deterministic policy gradient algorithms,improves the stability and convergence speed of the algorithm,and shows better performance in multiple tasks.

关键词：深度强化学习低估偏差确定性策略梯度 Actor-Critic框架值函数

分类号：TP18[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于组合网络优化的延迟深度确定性策略梯度

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于组合网络优化的延迟深度确定性策略梯度

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索