基于深度强化学习的抗感知误差空战机动决策  

Perception Error-resistant Air Combat Maneuvering Decisions Based on Deep Reinforcement Learning

在线阅读下载全文

作  者:田成滨 李辉[1,2] 陈希亮 吴冯国 TIAN Chengbin;LI Hui;CHEN Xiliang;WU Fengguo(College of Computer Science,Sichuan University,Chengdu 610065,China;Nation Key Laboratory of Fundamental Science on Synthetic Vision,Sichuan University,Chengdu 610065,China;College of Command and Control Engineering,Army Engineering University of PLA,Nanjing 210007,China)

机构地区:[1]四川大学计算机学院,四川成都610065 [2]四川大学视觉合成图形图像技术国防重点学科实验室,四川成都610065 [3]陆军工程大学指挥控制工程学院,江苏南京210007

出  处:《工程科学与技术》2024年第6期270-282,共13页Advanced Engineering Sciences

基  金:国家自然科学基金重点项目(U20A20161);国家自然科学基金项目(62273356)。

摘  要:在视距内空战机动决策中,以光电传感器和雷达为代表的机载感知设备易受敌方干扰或气象因素等影响,产生态势感知误差。深度强化学习(DRL)在空战机动决策中虽已取得了重要进展,但现有方法并未考虑空战态势感知误差对DRL的影响。由于状态空间是连续且高维的,态势感知误差会影响状态估计的精度和准确性,进而影响DRL的训练速度及决策效果。针对上述问题,提出一种基于门控循环单元(GRU)提取态势特征的近端策略优化算法(GPPO)。首先,在近端策略优化算法(PPO)基础上引入门控循环单元来融合前序态势信息,提取连续态势序列之间的隐藏特征。随后,通过优势态势解算单元压缩DRL的状态空间维度,从而降低训练难度,并设计一种量化优势的奖励塑造(RS)方法来引导DRL训练加速收敛。最后,定义并描述了视距内空战的相对态势模型,通过设计和引入态势感知误差量,搭建具备态势感知误差的空战仿真环境,并在不同感知误差强度及不同敌我初始态势等多种场景下进行仿真对比实验。仿真结果表明:GPPO算法在具备态势感知误差的多种视距内空战场景里均能有效完成空战优势机动决策,使用GPPO和量化优势RS方法的模型训练收敛速度和机动决策性能均显著优于基础强化学习算法,有效提高了无人机面对态势感知误差时的空战机动决策能力。In the maneuvering decision-making process of within-visual-range air combat,onboard sensing equipment,represented by photoelectric sensors and radars,is susceptible to enemy jamming or meteorological factors,resulting in situational perception errors.Although deep reinforcement learning(DRL)has made significant progress in air combat maneuvering decision-making,existing methods do not consider the influence of situational perception errors.Since the state space is continuous and high-dimensional,situational perception errors affect the precision and accuracy of state estimation,impacting DRL’s training speed and decision-making performance.In response to this issue,an algorithm based on Proximal Policy Optimization(PPO)is proposed using features extracted by a Gated Recurrent Unit(GRU).The GRU is introduced to fuse the previous state sequence and extract the hidden features in the postures,initially utilizing the PPO algorithm,and a dominant posture-solving unit is set to compress the state space dimension,reducing the training difficulty.The reward shaping(RS)method,which quantifies the advantage,is subsequently designed to guide the faster convergence of DRL training.Finally,the relative situation model of within-visual-range air combat is defined and described.An air combat simulation environment with situational awareness errors is built by designing and introducing situational perception errors,and simulation and comparison experiments are conducted under various scenarios,such as different intensities of perception error and different initial situations of the enemy or foe.The simulation results showed that the GPPO algorithm can effectively complete air combat maneuvering decisions in various within-visual-range air combat scenarios with situational perception errors.The model training convergence speed and maneuvering decision performance using the GPPO and RS method,which quantifies the advantage,are significantly better than basic reinforcement learning algorithms.Despite situational perception error

关 键 词:深度强化学习 视距内空战 机动决策 感知误差 奖励塑造 无人机 

分 类 号:E926[军事—军事装备学] TP181[兵器科学与技术—武器系统与运用工程] V217[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象