检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王琨 赵英策 王光耀[2] 李建勋[1] WANG Kun;ZHAO Yingce;WANG Guangyao;LI Jianxun(Shanghai Jiao Tong University Department of Automation,Shanghai 200240,China;Shenyang Aircraft Design and Research Institute,Shenyang 110035,China)
机构地区:[1]上海交通大学自动化系,上海200240 [2]沈阳飞机设计研究所,沈阳110035
出 处:《指挥控制与仿真》2024年第5期77-84,共8页Command Control & Simulation
基 金:国家自然科学基金(61673265);国家重点研发计划(2020YFC1512203);上海商用飞机系统工程联合研究基金(CASEF-2022-MQ01)。
摘 要:提升多智能体训练效果一直是强化学习领域中的重点。以多智能体双延迟深度确定性策略梯度(MATD3)算法为基础,引入参数共享机制,进而提升训练效率。同时为缓解真实奖励与辅助奖励不一致的问题,借鉴课程学习思想,提出辅助奖励衰减因子,以保证训练初期的策略探索积极性与训练末期的奖励一致性。将所提出的改进式MATD3算法应用于战车博弈对抗,从而实现战车的智能决策,应用结果表明,智能战车的奖励曲线收敛稳定,且效果良好。同时就改进式算法与原始MATD3算法进行对比仿真,仿真结果验证了改进式算法能够有效提升收敛速度以及奖励收敛值。Improving the training effect of multi-agent has always been the focus in the field of reinforcement learning.Based on the multi-Agent twin-delay deep deterministic policy gradient(MATD3)algorithm,a parameter sharing mechanism is introduced to improve training efficiency.At the same time,in order to alleviate the inconsistency between real rewards and auxiliary rewards,drawing on the ideas of course learning,a decay factor for auxiliary rewards is proposed to ensure the motivation of policy exploration in the early training period and the reward consistency in the late training period.And the proposed improved MATD3 algorithm is applied to combat vehicle games to achieve intelligent decision-making of the vehicle.The application results show that the reward curve of the vehicle converges stably and the effect is good.Besides,the improved algorithm is compared with the original MATD3 algorithm,and the simulation results verify that the improved algorithm can effectively improve the effect of convergence and the convergence value of reward.
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7