OSCAR:OOD State-Conservative Offline Reinforcement Learning for Sequential Decision Making  

在线阅读下载全文

作  者:Yi Ma Chao Wang Chen Chen Jinyi Liu Zhaopeng Meng Yan Zheng Jianye Hao 

机构地区:[1]College of Intelligence and Computing,Tianjin University,Tianjin 300350,China [2]Lab for High Technology,Tsinghua University,Beijing 100084,China [3]Department of Automation,Tsinghua University,Beijing 100084,China [4]Noah’s Ark Lab,Huawei Technologics,Co.,Ltd.,Beijing 100084,China

出  处:《CAAI Artificial Intelligence Research》2023年第1期91-101,共11页CAAI人工智能研究(英文)

基  金:supported by the National Key R&D Program of China(No.2022ZD0116402);the National Natural Science Foundation of China(No.62106172).

摘  要:Offline reinforcement learning(RL)is a data-driven learning paradigm for sequential decision making.Mitigating the overestimation of values originating from out-of-distribution(OOD)states induced by the distribution shift between the learning policy and the previously-collected offline dataset lies at the core of offline RL.To tackle this problem,some methods underestimate the values of states given by learned dynamics models or state-action pairs with actions sampled from policies different from the behavior policy.However,since these generated states or state-action pairs are not guaranteed to be OOD,staying conservative on them may adversely affect the in-distribution ones.In this paper,we propose an OOD state-conservative offline RL method(OSCAR),which aims to address the limitation by explicitly generating reliable OOD states that are located near the manifold of the offline dataset,and then design a conservative policy evaluation approach that combines the vanilla Bellman error with a regularization term that only underestimates the values of these generated OOD states.In this way,we can prevent the value errors of OOD states from propagating to in-distribution states through value bootstrapping and policy improvement.We also theoretically prove that the proposed conservative policy evaluation approach guarantees to underestimate the values of OOD states.OSCAR along with several strong baselines is evaluated on the offline decision-making benchmarks D4RL and autonomous driving benchmark SMARTS.Experimental results show that OSCAR outperforms the baselines on a large portion of the benchmarks and attains the highest average return,substantially outperforming existing offline RL methods.

关 键 词:offline reinforcement learning out-of-distribution decision making 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象