基于安全强化学习的月球着陆器控制  

Control of lunar landers based on secure reinforcement learning

作  者:杨敏 刘关俊[1] 周子渊 YANG Min;LIU Guanjun;ZHOU Ziyuan(Department of Computer Science and Technology,Tongji University,Shanghai 201804,China)

机构地区:[1]同济大学计算机科学与技术系,上海201804

出  处:《航空学报》2025年第3期118-131,共14页Acta Aeronautica et Astronautica Sinica

基  金:国家自然科学基金(62172299,62032019);北京控制工程研究所空间光电测量与感知实验室开放基金(LabSOMP-2023-03);中央高校基本科研业务费专项资金(2023-4-YB-05);上海市科技创新行动计划(22511105500)。

摘  要:在月球着陆任务中,着陆器必须在极端环境下进行精确操作,并且通常面临着通信延迟的挑战,这些因素严重限制了地面控制的实时操作能力。针对这些挑战,研究提出了一种基于半马尔可夫决策过程(SMDP)的深度强化学习安全性提升框架,旨在提高航天器自主着陆的操作安全性。为了实现状态空间的压缩并保持决策过程的关键特征,该框架通过对历史轨迹的马尔可夫决策过程(MDP)压缩成SMDP,并根据压缩后的轨迹数据构建抽象SMDP状态转移图,然后识别潜在风险的关键状态-动作对,并实施实时监控及干预,有效提高了航天器的自主着陆安全性。采用了反向广度优先搜索方法,搜索出对任务结果有决定性影响的状态-动作对,并通过搭建的状态-动作监控器实现对模型的实时调整。实验结果显示,该框架在不需增加额外传感器或显著改变现有系统配置的条件下,能够在预训练的深度Q网络(DQN)、Dueling DQN、DDQN模型上,提升月球着陆器在模拟环境中的任务成功率高达22%,在预设的安全性评价标准下,该框架能提升最高42%的安全性。此外,虚拟环境中的模拟结果展示了该框架在月球着陆等复杂航天任务中的实际应用潜力,可以有效提升操作安全性和效率。In lunar landing missions, the lander must perform precise operations in extreme environments and oftenfaces the challenge of communication delays. These factors severely limit the real-time operation capabilities of groundcontrol. In response to these challenges, this study proposes a Deep Reinforcement Learning (DRL) framework forsafety enhancement based on the Semi-Markov Decision Process (SMDP) to improve the operational safety of au⁃tonomous spacecraft landing. To compress the state space and maintain the key characteristics of the decisionmakingprocess, this framework compresses the Markov Decision Process (MDP) of the historical trajectory into aSMDP, and constructs an abstract SMDP state transition diagram based on the compressed trajectory. Then, the keystate-action pairs of potential risks are identified, and the real-time monitoring and intervention strategy is imple⁃mented. The framework effectively improves the safety of the spacecraft’s autonomous landing. Furthermore, the re⁃verse breadth first search method is used to search for the state-action pairs that have decisive impact on task results,and real-time adjustment of the model is realized through the built state-action monitor. Experimental results show thatthis framework increases the mission success rate of the lunar lander by up to 22% in a simulated environment on thepre-trained Deep Q-Network (DQN), Dueling DQN, and DDQN models without adding additional sensors or signifi⁃cantly changing the existing system configuration. According to the preset safety evaluation standards, the frameworkcan improve safety by up to 42%. In addition, simulation results in a virtual environment demonstrate the practical ap⁃plication potential of this framework in complex space missions such as lunar landing, which can effectively improveoperational safety and efficiency.

关 键 词:深度强化学习 自主着陆 抽象SMDP状态转移图 安全性提升 实时监控 反向广度优先搜索 

分 类 号:V448[航空宇航科学与技术—飞行器设计] TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象