基于平均场内生奖励的多智能体强化学习算法

Model-based Multi-agent Mean-field Intrinsic Reward Upper Confidence Reinforcement Learning Algorithm

作　　者：孙文绮李大鹏[1] 田峰[1] 丁良辉[2] SUN Wenqi;LI Dapeng;TIAN Feng;DING Lianghui(School of Communications and Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003,China;School of Electronic Information and Electrical Engineering,Shanghai Jiaotong University,Shanghai 200240,China)

机构地区：[1]南京邮电大学通信与信息工程学院,江苏南京210003 [2]上海交通大学电子工程系,上海200240

出　　处：《无线电通信技术》2023年第3期556-565,共10页Radio Communications Technology

基　　金：国家重点研发计划(2021ZD0140405)。

摘　　要：针对复杂的多智能体应用场景中只依靠根据最终目标设计的简单奖励函数无法对智能体学习策略做出有效引导的问题,提出了一种基于平均场内生奖励的多智能体强化学习(Model-based Multi-agent Mean-field Intrinsic Reward Upper Confidence Reinforcement Learning, M3IR-UCRL)算法。该算法在奖励函数中增加了内生奖励模块,用生成的内生奖励与定义任务的外部奖励一起帮助代表智能体在用平均场控制(Mean-Field Control, MFC)化简的多智能体系统中学习策略。智能体学习时首先按照期望累积内外奖励加权和的梯度方向更新策略参数,然后按照期望累积外部奖励的梯度方向更新内生奖励参数。仿真结果表明,相比于只用简单外部奖励引导智能体学习的(Model-based Multi-agent Mean-field Intrinsic Reward Upper Confidence Reinforcement Learning, M3-UCRL)算法,所提算法可以有效提高智能体在复杂的多智能体场景中的任务完成率,降低与周围环境的碰撞率,从而使算法的整体性能得到提升。Aiming at the problem that the simple reward function designed according to the final goal cannot guide the agent learning strategy effectively in the complex multi-agent application scenario,a Model-based Multi-agent Mean-field Intrinsic Reward Upper Confidence Reinforcement Learning(M3IR-UCRL)algorithm is proposed.It adds intrinsic reward module to the original reward function.The intrinsic reward generated by the module together with the external reward that defines the task help the representative agent to learn strategy in the multi-agent system simplified by Mean-Field Control(MFC).When learning in MFC,the agent first updates the policy parameters according to the gradient direction of the weighted sum of the expected cumulative internal and external rewards,and then updates the intrinsic reward parameters according to the gradient direction of the expected cumulative external rewards.The simulation results show that compared with the Model-based Multi-agent Mean-field Upper-Confidence RL(M3-UCRL)algorithm which only uses simple external rewards to guide the learning of agents,the proposed algorithm can effectively improve the task completion rate of agents in complex multi-agent scenarios and reduce the collision rate with the surrounding environment,so as to improve the overall performance of the algorithm.

关键词：多智能体系统平均场控制基于模型的强化学习内生奖励

分类号：TP181[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于平均场内生奖励的多智能体强化学习算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于平均场内生奖励的多智能体强化学习算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索