机构地区:[1]北京科技大学计算机与通信工程学院,北京100083 [2]北京科技大学顺德研究生院,广东佛山528399
出 处:《计算机学报》2022年第11期2306-2320,共15页Chinese Journal of Computers
基 金:国家自然科学基金(62101029);博士后创新人才支持计划(BX20190033);广东省基础与应用基础研究基金联合基金(2019A1515110325);中国博士后基金面上项目(2020M670135);北京科技大学顺德研究生院博士后科研经费(2020BH001);中央高校基本科研业务费(06500127)资助.
摘 要:近年来,强化学习方法在游戏博弈、机器人导航等多种应用领域取得了令人瞩目的成果.随着越来越多的现实场景需要多个智能体完成复杂任务,强化学习的研究领域已逐渐从单一智能体转向多智能体.而在多智能体强化学习问题的研究中,让智能体学会协作成为当前的一大研究热点.在这一过程中,多智能体信用分配问题亟待解决.这是因为部分可观测环境会针对智能体产生的联合动作产生奖励强化信号,并将其用于强化学习网络参数的更新.也就是说,当所有智能体共享一个相同的全局奖励时,难以确定系统中的每一个智能体对整体所做出的贡献.除此之外,当某个智能体提前学习好策略并获得较高的回报时,其他智能体可能停止探索,使得整个系统陷入局部最优.针对这些问题,本文提出了一种简单有效的方法,即基于奖励滤波的信用分配算法.将其他智能体引起的非平稳环境影响建模为噪声,获取集中训练过程中的全局奖励信号,经过滤波后得到每个智能体的局部奖励,用于协调多智能体的行为,更好地实现奖励最大化.我们还提出了基于奖励滤波的多智能体深度强化学习(RF-MADRL)框架,并在Open AI提供的合作导航环境中成功地进行了验证.实验结果表明,和基线算法相比,使用基于奖励滤波的深度强化学习方法有着更好的表现,智能体系统策略收敛速度更快,获得的奖励更高.In recent decades,reinforcement learning has achieved remarkable successes in many fields such as intelligent traffic control,competitive gaming,unmanned system positioning,and navigation.As more and more realistic scenarios require multi-agent to undertake complex tasks cooperatively,researchers pay more attention to studying multi-agent than single agents in reinforcement learning.At the same time,in multi-agent reinforcement learning(MARL),learning cooperation is a new research hotspot,which means agents need to learn to cooperate using only actions and local observations.However,the credit assignment problem needs to be solved when studying the cooperative process of a multi-agent system with DRL.In the process of learning to complete tasks,the partially observable environment provides reward reinforcement signals for the joint actions produced by agents,which are used to update the parameters of the deep reinforcement learning network.But the global reward is non-Markovian.When an agent takes action at the current state,the actual reward signal for this action is usually given after several time steps.Especially in a difficult multi-agent environment,this phenomenon is more serious.In addition,all agents share the same global reward,making it hard to determine how much each agent in the system contributes to the whole.When an agent learns the strategy well in advance and gains a high return,the others may stop exploring,which leads the whole system to trap in the local optimum.To solve these problems,this paper introduces a credit assignment algorithm based on reward filtering that is not restricted by action space.The goal is to restore the local reward of each agent from the global reward obtained by all agents and apply it to the training of the action-value function network.The exploration behaviour of other agents often causes the non-stationarity of the environment,and the agent’s own reward signal can be obtained by removing the influence of non-stationary from the global reward.Based on this,starti
关 键 词:多智能体系统 深度强化学习 信用分配 奖励滤波 合作导航
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...