基于自适应不确定性度量的离线强化学习算法

Adaptive uncertainty quantification for model-based offline reinforcement learning

作　　者：张伯雷刘哲闰 ZHANG Bolei;LIU Zherun(School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)

机构地区：[1]南京邮电大学计算机学院,江苏南京210023

出　　处：《南京邮电大学学报（自然科学版）》2024年第4期98-104,共7页Journal of Nanjing University of Posts and Telecommunications：Natural Science Edition

基　　金：国家自然科学基金(62202238)资助项目。

摘　　要：离线强化学习可以从历史经验数据中直接学习出可执行的策略,由此来避免与在线环境的高代价交互,可应用于机器人控制、无人驾驶、智能营销等多种真实场景。有模型的离线强化学习首先通过监督学习构造环境模型,并通过与该环境模型交互来优化学习策略,具有样本效率高的特点,是最常用的离线强化学习算法。然而,由于离线数据集存在分布偏移问题,现有的方法往往通过静态的方法来评估此种不确定性,无法动态自适应于智能体策略的优化过程。针对以上问题,提出一种自适应的不确定性度量方法,首先对状态的不确定性进行估计,然后通过动态自适应的方法来衡量环境模型的不确定性,从而使得智能体可以在探索-保守中取得更好的平衡。在多个基准的离线数据集对算法进行了验证,实验结果表明,该算法在多个数据集中都取得最好的效果,消融实验等也验证了所提方法的有效性。Offline reinforcement learning(RL)can optimize agent policies directly from historical offline datasets,avoiding the risky interactions with online environment.It is promising to be used in robot manipulation,autonomous driving,intelligent recommendation,etc.Model-based offline RL starts from constructing a supervised environmental model,and then interacts with this model to optimize the policy.This approach has high sample efficiency and has been widely considered in related studies.However,the distributional shift between the offline dataset and the online environment can also lead to out-of-distribution problem.Current methods mainly considered static metrics to measure the uncertainty from the environment model,and cannot adapt to the dynamic policy optimization process.Targeting the above problem,we propose a novel adaptive uncertainty quantification method.This method estimates the uncertainty of each state,and then uses the dynamic weight for the uncertainty quantification.Thus a better trade-off can be achieved between the conservatism and radicalism.Evaluations on multiple benchmarks validate the effectiveness of the algorithm.Ablation studies also demonstrate the usefulness of the measurements.

关键词：离线强化学习环境模型自适应权重不确定性度量

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于自适应不确定性度量的离线强化学习算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于自适应不确定性度量的离线强化学习算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索