基于相对熵的元逆强化学习方法被引量：4

Meta-inverse Reinforcement Learning Method Based on Relative Entropy

作　　者：吴少波傅启明[1,2,3] 陈建平[2,3] 吴宏杰[1,2] 陆悠 WU Shao-bo;FU Qi-ming;CHEN Jian-ping;WU Hong-jie;LU You(School of Electronics and Information Engineering,Suzhou University of Science and Technology,Suzhou,Jiangsu 215009,China;Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency,Suzhou University of Science and Technology,Suzhou,Jiangsu 215009,China;Suzhou Key Laboratory of Mobile Network Technology and Application,Suzhou University of Science and Technology,Suzhou,Jiangsu 215009,China)

机构地区：[1]苏州科技大学电子与信息工程学院,江苏苏州215009 [2]苏州科技大学江苏省建筑智慧节能重点实验室,江苏苏州215009 [3]苏州科技大学苏州市移动网络技术与应用重点实验室,江苏苏州215009

出　　处：《计算机科学》2021年第9期257-263,共7页Computer Science

基　　金：国家自然科学基金项目(61876217,61876121,61772357,61750110519,61772355,61702055,61672371);江苏省重点研发计划项目(BE2017663)。

摘　　要：针对传统逆强化学习算法在缺少足够专家演示样本以及状态转移概率未知的情况下,求解奖赏函数速度慢、精度低甚至无法求解的问题,提出一种基于相对熵的元逆强化学习方法。利用元学习方法,结合与目标任务同分布的一组元训练集,构建目标任务学习先验,在无模型强化学习问题中,采用相对熵概率模型对奖赏函数进行建模,并结合所构建的先验,实现利用目标任务少量样本快速求解目标任务奖赏函数的目的。将所提算法与REIRL算法应用于经典的Gridworld和Object World问题,实验表明,在目标任务缺少足够数目的专家演示样本和状态转移概率信息的情况下,所提算法仍能较好地求解奖赏函数。Aiming at the problem that traditional inverse reinforcement learning algorithms are slow,imprecise,or even unsolvable when solving the reward function owing to insufficient expert demonstration samples and unknown state transition probabilitie,a meta-reinforcement learning method based on relative entropy is proposed.Using meta-learning methods,the target task learning prior is constructed by integrating a set of meta-training sets that meet the same distribution as the target task.In the model-free reinforcement learning problem,the relative entropy probability model is used to model the reward function and combined with the prior to achieve the goal of quickly solving the reward function of the target task using a small number of samples of the target task.The proposed algorithm and the RE IRL algorithm are applied to the classic Gridworld and Object World pro-blems.Experiments show that the proposed algorithm can still solve the reward function better when the target task lacks a sufficient number of expert demonstration samples and state transition probabilities information.

关键词：逆强化学习元学习奖赏函数相对熵梯度下降

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于相对熵的元逆强化学习方法被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于相对熵的元逆强化学习方法 被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于相对熵的元逆强化学习方法被引量：4