基于重要性采样的优势估计器

Advantage estimator based on importance sampling

作　　者：刘全[1,2,3,4] 姜玉斌胡智慧 LIU Quan;JIANG Yubin;HU Zhihui(School of Computer Science and Technology, Soochow University, Suzhou 215006, China;Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou 215006, China;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China;Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210093, China)

机构地区：[1]苏州大学计算机科学与技术学院,江苏苏州215006 [2]苏州大学江苏省计算机信息处理技术重点实验室,江苏苏州215006 [3]吉林大学符号计算与知识工程教育部重点实验室,吉林长春130012 [4]软件新技术与产业化协同创新中心,江苏南京210093

出　　处：《通信学报》2019年第5期108-116,共9页Journal on Communications

基　　金：国家自然科学基金资助项目(No.61772355;No.61702055;No.61472262;No.61502323;No.61502329);江苏省高等学校自然科学研究重大基金资助项目(No.18KJA520011;No.17KJA520004);吉林大学符号计算与知识工程教育部重点实验室基金资助项目(No.93K172014K04;No.93K172017K18);苏州市应用基础研究计划工业部分基金资助项目(No.SYG201422)~~

摘　　要：在连续动作任务中,深度强化学习通常采用高斯分布作为策略函数。针对高斯分布策略函数由于截断动作导致算法收敛速度变慢的问题,提出了一种重要性采样优势估计器(ISAE)。该估计器在通用优势估计器(GAE)的基础上,引入了重要性采样机制,通过计算边界动作的目标策略与行动策略比率修正截断动作带来的值函数偏差,提高了算法的收敛速度。此外,ISAE引入了L参数,通过限制重要性采样率的范围,提高了样本的可靠度,保证了网络参数的稳定。为了验证ISAE的有效性,将ISAE与近端策略优化结合并与其他算法在Mu Jo Co平台上进行比较。实验结果表明,ISAE具有更快的收敛速度。In continuous action tasks, deep reinforcement learning usually uses Gaussian distribution as a policy function. Aiming at the problem that the Gaussian distribution policy function slows down due to the clipped action, an importance sampling advantage estimator was proposed. Based on the general advantage estimator, an importance sampling mecha- nism was introduced by the estimator to improve the convergence speed of the algorithm and correct the deviation of the value function caused by calculating the target strategy and action strategy ratio of the boundary action. In addition, the L parameter was introduced by ISAE which improved the reliability of the sample and limited the stability of the network parameters by limiting the range of the importance sampling rate. In order to verify the effectiveness of the ISAE, apply- ing it to proximal policy optimization and comparing it with other algorithms on the MuJoCo platform. Experimental re- sults show that ISAE has a faster convergence rate.

关键词：强化学习重要性采样深度强化学习优势函数

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于重要性采样的优势估计器

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于重要性采样的优势估计器

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索