检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:徐平安 刘全[1,2,3,4] XU Ping'an;LIU Quan(School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China;Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China;Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China)
机构地区:[1]苏州大学计算机科学与技术学院,江苏苏州215006 [2]软件新技术与产业化协同创新中心,南京210000 [3]吉林大学符号计算与知识工程教育部重点实验室,长春130012 [4]苏州大学江苏省计算机信息处理技术重点实验室,江苏苏州215006
出 处:《计算机科学》2023年第1期253-261,共9页Computer Science
基 金:国家自然科学基金(61772355,61702055);江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18);苏州市应用基础研究计划工业部分(SYG201422);江苏高校优势学科建设工程资助项目。
摘 要:策略蒸馏是一种将知识从一个策略转移到另一个策略的方法,在具有挑战性的强化学习任务中获得了巨大的成功。典型的策略蒸馏方法采用的是师生策略模型,即知识从拥有优秀经验数据的教师策略迁移到学生策略。获得一个教师策略需要耗费大量的计算资源,因此双策略蒸馏框架(Dual Policy Distillation,DPD)被提出,其不再依赖于教师策略,而是维护两个学生策略互相进行知识迁移。然而,若其中一个学生策略无法通过自我学习超越另一个学生策略,或者两个学生策略在蒸馏后趋于一致,则结合DPD的深度强化学习算法会退化为单一策略的梯度优化方法。针对上述问题,给出了学生策略之间相似度的概念,并提出了基于相似度约束的双策略蒸馏框架(Similarity Constrained Dual Policy Distillation,SCDPD)。该框架在知识迁移的过程中,动态地调整两个学生策略间的相似度,从理论上证明了其能够有效提升学生策略的探索性以及算法的稳定性。实验结果表明,将SCDPD与经典的异策略和同策略深度强化学习算法结合的SCDPD-SAC算法和SCDPD-PPO算法,在多个连续控制任务上,相比经典算法具有更好的性能表现。Policy distillation,a method of transferring knowledge from one policy to another,has achieved great success in challenging reinforcement learning tasks.The typical policy distillation approach uses a teacher-student policy model,where know-ledge is transferred from the teacher policy,which has excellent empirical data,to the student policy.Obtaining a teacher policy is computationally intensive,so dual policy distillation(DPD)framework is proposed,which maintains two student policies to transfer knowledge to each other and no longer depends on the teacher policy.However,if one of the student policies cannot surpass the other through self-learning,or if the two student policies converge after distillation,the deep reinforcement learning algorithm combined with DPD degenerates into a single policy gradient optimization approach.To address the problems mentioned above,the concept of similarity between student policies is given,and the similarity constrained dual policy distillation(SCDPD)framework is proposed.The framework dynamically adjusts the similarity between two students’policies in the process of knowledge transfer,and has been theoretically shown to be effective in enhancing the exploration of students′policies as well as the stability of algorithms.Experimental results show that the SCDPD-SAC algorithm and SCDPD-PPO algorithm,which combine SCDPD with classical off-policy and on-policy deep reinforcement learning algorithms,have better performance compared with classical algorithms on multiple continuous control tasks.
关 键 词:深度强化学习 策略蒸馏 相似度约束 知识迁移 连续控制任务
分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7