拦截机动目标的信赖域策略优化制导算法  被引量:2

Trust region policy optimization guidance algorithm for intercepting maneuvering target

在线阅读下载全文

作  者:陈文雪 高长生[1] 荆武兴[1] CHEN Wenxue;GAO Changsheng;JING Wuxing(School of Astronautics,Harbin Institute of Technology,Harbin 150001,China)

机构地区:[1]哈尔滨工业大学航天学院,哈尔滨150001

出  处:《航空学报》2023年第11期277-295,共19页Acta Aeronautica et Astronautica Sinica

基  金:国家自然科学基金(12072090)。

摘  要:针对临近空间高超声速飞行器的高速性、机动性等特性,为提高制导算法针对不同初始状态、不同机动性目标的准确性、鲁棒性及智能性,提出一种基于信赖域策略优化(TRPO)算法的深度强化学习制导算法。基于TRPO算法的制导算法由2个策略(动作)网络、1个评价网络共同组成,将临近空间目标与拦截弹相对运动系统状态以端对端的方式直接映射为制导指令。在算法训练过程中合理选取连续动作空间、状态空间、并通过权衡能量消耗、相对距离等因素构建奖励函数加快其收敛速度,最终依据训练的智能体模型针对不同任务场景进行拦截测试。仿真结果表明:与传统比例导引律(PN)及改进比例导引律(IPN)相比,本文算法针对学习场景及未知场景均具有更小的脱靶量、更稳定的拦截效果、鲁棒性,并能够在多种配置计算机上广泛应用。Considering the characteristics of high speed and maneuverability of hypersonic vehicles in near-space,this paper proposes a deep reinforcement learning guidance algorithm based on the Trust Region Policy Optimization(TRPO)algorithm to improve the accuracy,robustness,and intelligence of the guidance algorithm for intercepting tar⁃gets with different initial states and different maneuverability modes.The guidance algorithm based on the TRPO algo⁃rithm is composed of two policy(action)networks and a critic network,directly mapping the relative motion system state of the near-space target and the interceptor to the guidance command of the interceptor.In the algorithm training process,continuous action space and state space are reasonably designed,and the reward function is constructed to accelerate the training convergence speed by weighing energy consumption,relative distance,and other factors.Fi⁃nally,tests are conducted for different task scenarios according to the trained agent model.The simulation results show that,compared with the traditional Proportional Navigation guidance law(PN)and the Improved Proportional Navigation guidance law(IPN),the guidance algorithm in this paper has smaller miss distances,a more stable inter⁃ception effect,and robustness for learned scenarios and unknown scenarios,and can be widely used on multiple con⁃figuration computers.

关 键 词:深度强化学习 信任域策略优化 临近空间拦截 导弹末制导 机动目标 马尔可夫过程 

分 类 号:V488.133[航空宇航科学技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象