检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张钰欣 赵恩娇 赵玉新 ZHANG Yuxin;ZHAO Enjiao;ZHAO Yuxin(College of Intelligent Systems Science and Engineering,Harbin Engineering University,Harbin 150001,China)
机构地区:[1]哈尔滨工程大学智能科学与工程学院,黑龙江哈尔滨150001
出 处:《智能系统学报》2024年第1期190-208,共19页CAAI Transactions on Intelligent Systems
基 金:国家自然科学基金项目(61903099);黑龙江省自然科学基金项目(LH2020F025);重庆市教育委员会科学技术研究计划(KJZD-K20200470);中国博士后科学基金面上项目(2021M690812);黑龙江省博士后基金面上项目(LBH-Z21048)。
摘 要:针对多无人机博弈对抗过程中无人机数量动态衰减问题和传统深度强化学习算法中的稀疏奖励问题及无效经验抽取频率过高问题,本文以攻防能力及通信范围受限条件下的多无人机博弈对抗任务为研究背景,构建了红、蓝两方无人机群的博弈对抗模型,在多智能体深度确定性策略梯度(multi-agent deep deterministic policy gradient,MADDPG)算法的Actor-Critic框架下,根据博弈环境的特点对原始的MADDPG算法进行改进。为了进一步提升算法对有效经验的探索和利用,本文构建了规则耦合模块以在无人机的决策过程中对Actor网络进行辅助。仿真实验表明,本文设计的算法在收敛速度、学习效率和稳定性方面都取了一定的提升,异构子网络的引入使算法更适用于无人机数量动态衰减的博弈场景;奖励势函数和重要性权重耦合的优先经验回放方法提升了经验差异的细化程度及优势经验利用率;规则耦合模块的引入实现了无人机决策网络对先验知识的有效利用。In order to overcome of dynamic attenuation of the number of UAVs in the process of multi-UAV game confrontation,and solve the sparse reward problem in the traditional deep reinforcement learning algorithm and the high frequency of invalid experience extraction,a game model of red and blue UAV clusters is built in this paper based on the background of multi-unmanned aerial vehicles(Multi-UAVs)game with limited attack and defense capabilities and communication range.Under the Actor-Critic framework of multi-agent deep deterministic policy gradient(MADDPG)algorithm,the original MADDPG algorithm is improved according to the characteristics of the game scenario to solve the problem of the number attenuation,sparse rewards and high extraction frequency of invalid experience of UAVs in the original algorithm.On this basis,in order to improve the exploration and utilization of algorithm for effective experiences,a rule coupling module is built to assist UAV.The simulation experiment shows that the algorithm designed in this paper has improved the convergence speed,learning efficiency and stability.The use of polyisomer network makes the algorithm more suitable for the game scenario that the number of UAVs declines dynamically;the reward potential function and the priority experience playback method based on the importance weight coupling improve the degree of refinement of experience difference and the utilization rate of superior experience;the introduction of rule coupling module realizes the effective utilization of UAV decision network for priori knowledge.
关 键 词:深度强化学习 多无人机 博弈对抗 MADDPG Actor-Critic 规则耦合 经验回放 稀疏奖励
分 类 号:V279[航空宇航科学与技术—飞行器设计]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49