Robust Offline Actor-Critic With On-policy Regularized Policy Evaluation  

在线阅读下载全文

作  者:Shuo Cao Xuesong Wang Yuhu Cheng 

机构地区:[1]the Engineering Research Center of Intelligent Control for Underground Space,Ministry of Education,and the School of Information and Control Engineering,China University of Mining and Technology,Xuzhou 221116,China [2]IEEE

出  处:《IEEE/CAA Journal of Automatica Sinica》2024年第12期2497-2511,共15页自动化学报(英文版)

基  金:supported in part by the National Natural Science Foundation of China(62176259,62373364);the Key Research and Development Program of Jiangsu Province(BE2022095)。

摘  要:To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SARSA-style)to develop an offline reinforcement learning(RL)method termed robust offline Actor-Critic with on-policy regularized policy evaluation(OPRAC).With the help of SARSA-style bootstrap actions,a conservative on-policy Q-function and a penalty term for matching the on-policy and off-policy actions are jointly constructed to regularize the optimal Q-function of off-policy QL-style.This naturally equips the off-policy QL-style policy evaluation with the intrinsic pessimistic conservatism of on-policy SARSA-style,thus facilitating the acquisition of stable estimated Q-function.Even with limited data sampling errors,the convergence of Q-function learned by OPRAC and the controllability of bias upper bound between the learned Q-function and its true Q-value can be theoretically guaranteed.In addition,the sub-optimality of learned optimal policy merely stems from sampling errors.Experiments on the well-known D4RL Gym-MuJoCo benchmark demonstrate that OPRAC can rapidly learn robust and effective tasksolving policies owing to the stable estimate of Q-value,outperforming state-of-the-art offline RLs by at least 15%.

关 键 词:Offline reinforcement learning off-policy QL-style on-policy SARSA-style policy evaluation(PE) Q-value estimation 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象