基于相似度约束的双策略蒸馏深度强化学习方法被引量：1

Deep Reinforcement Learning Based on Similarity Constrained Dual Policy Distillation

作　　者：徐平安刘全[1,2,3,4] XU Ping'an;LIU Quan(School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China;Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China;Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China)

机构地区：[1]苏州大学计算机科学与技术学院,江苏苏州215006 [2]软件新技术与产业化协同创新中心,南京210000 [3]吉林大学符号计算与知识工程教育部重点实验室,长春130012 [4]苏州大学江苏省计算机信息处理技术重点实验室,江苏苏州215006

出　　处：《计算机科学》2023年第1期253-261,共9页Computer Science

基　　金：国家自然科学基金(61772355,61702055);江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18);苏州市应用基础研究计划工业部分(SYG201422);江苏高校优势学科建设工程资助项目。

摘　　要：策略蒸馏是一种将知识从一个策略转移到另一个策略的方法,在具有挑战性的强化学习任务中获得了巨大的成功。典型的策略蒸馏方法采用的是师生策略模型,即知识从拥有优秀经验数据的教师策略迁移到学生策略。获得一个教师策略需要耗费大量的计算资源,因此双策略蒸馏框架(Dual Policy Distillation,DPD)被提出,其不再依赖于教师策略,而是维护两个学生策略互相进行知识迁移。然而,若其中一个学生策略无法通过自我学习超越另一个学生策略,或者两个学生策略在蒸馏后趋于一致,则结合DPD的深度强化学习算法会退化为单一策略的梯度优化方法。针对上述问题,给出了学生策略之间相似度的概念,并提出了基于相似度约束的双策略蒸馏框架(Similarity Constrained Dual Policy Distillation,SCDPD)。该框架在知识迁移的过程中,动态地调整两个学生策略间的相似度,从理论上证明了其能够有效提升学生策略的探索性以及算法的稳定性。实验结果表明,将SCDPD与经典的异策略和同策略深度强化学习算法结合的SCDPD-SAC算法和SCDPD-PPO算法,在多个连续控制任务上,相比经典算法具有更好的性能表现。Policy distillation,a method of transferring knowledge from one policy to another,has achieved great success in challenging reinforcement learning tasks.The typical policy distillation approach uses a teacher-student policy model,where know-ledge is transferred from the teacher policy,which has excellent empirical data,to the student policy.Obtaining a teacher policy is computationally intensive,so dual policy distillation(DPD)framework is proposed,which maintains two student policies to transfer knowledge to each other and no longer depends on the teacher policy.However,if one of the student policies cannot surpass the other through self-learning,or if the two student policies converge after distillation,the deep reinforcement learning algorithm combined with DPD degenerates into a single policy gradient optimization approach.To address the problems mentioned above,the concept of similarity between student policies is given,and the similarity constrained dual policy distillation(SCDPD)framework is proposed.The framework dynamically adjusts the similarity between two students’policies in the process of knowledge transfer,and has been theoretically shown to be effective in enhancing the exploration of students′policies as well as the stability of algorithms.Experimental results show that the SCDPD-SAC algorithm and SCDPD-PPO algorithm,which combine SCDPD with classical off-policy and on-policy deep reinforcement learning algorithms,have better performance compared with classical algorithms on multiple continuous control tasks.

关键词：深度强化学习策略蒸馏相似度约束知识迁移连续控制任务

分类号：TP181[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于相似度约束的双策略蒸馏深度强化学习方法被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于相似度约束的双策略蒸馏深度强化学习方法 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于相似度约束的双策略蒸馏深度强化学习方法被引量：1