基于策略蒸馏主仆框架的优势加权双行动者-评论家算法

Advantage Weighted Double Actors-Critics Algorithm Based on Key-Minor Architecture for Policy Distillation

作　　者：杨皓麟刘全[1,2] YANG Haolin;LIU Quan(School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China;Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China)

机构地区：[1]苏州大学计算机科学与技术学院,江苏苏州215006 [2]苏州大学江苏省计算机信息处理技术重点实验室,江苏苏州215006

出　　处：《计算机科学》2024年第11期81-94,共14页Computer Science

基　　金：国家自然科学基金(62376179,61772355,61702055,61876217,62176175);新疆维吾尔自治区自然科学基金(2022D01A238);江苏高校优势学科建设工程资助项目。

摘　　要：离线强化学习(Offline RL)定义了从固定批次的数据集中学习的任务,能够规避与环境交互的风险,提高学习的效率与稳定性。其中优势加权行动者-评论家算法提出了一种将样本高效动态规划与最大似然策略更新相结合的方法,在利用大量离线数据的同时,快速执行在线精细化策略的调整。但是该算法使用随机经验回放机制,同时行动者-评论家模型只采用一套行动者,数据采样与回放不平衡。针对以上问题,提出一种基于策略蒸馏并进行数据经验优选回放的优势加权双行动者-评论家算法(Advantage Weighted Double Actors-Critics Based on Policy Distillation with Data Experience Optimization and Replay,DOR-PDAWAC),该算法采用偏好新经验并重复回放新旧经验的机制,利用双行动者增加探索,并运用基于策略蒸馏的主从框架,将行动者分为主行为者和从行为者,提升协作效率。将所提算法应用到通用D4RL数据集中的MuJoCo任务上进行消融实验与对比实验,结果表明,其学习效率等均获得了更优的表现。Offline reinforcement learning(Offline RL)defines the task of learning from a fixed batch of dataset,which can avoid the risk of interacting with environment and improve the efficiency and stability of learning.Advantage weighted actor-critic algorithm,which combines sample efficient dynamic programming with maximum likelihood strategy updating,makes use of a large number of offline data and quickly performs online fine-grained strategy adjustment.However,the algorithm uses a random experience replay mechanism,while the actor-critic model only uses one set of actors,and data sampling and playback are unbalanced.In view of the above problems,an advantage weighted double actors-critics algorithm based on policy distillation with data expe-rience optimization and replay is proposed(DOR-PDAWAC),which adopts the mechanism of preferring new data and replaying old and new data repeatedly,uses double actors to increase exploration,and uses key-minor architecture for policy distillation to divide actors into key actor and minor actor to improve performance and efficiency.Applying algorithm to the MuJoCo task in the general D4RL dataset,and experimental results show that the proposed algorithm achieves better performance in terms of lear-ning efficiency and other aspect.

关键词：离线强化学习深度强化学习策略蒸馏双行动者-评论家框架经验回放机制

分类号：TP181[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于策略蒸馏主仆框架的优势加权双行动者-评论家算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于策略蒸馏主仆框架的优势加权双行动者-评论家算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索