基于符号知识的选项发现方法  

Option Discovery Method Based on Symbolic Knowledge

在线阅读下载全文

作  者:王麒迪 沈立炜[1] 吴天一 WANG Qidi;SHEN Liwei;WU Tianyi(School of Computer Science,Fudan University,Shanghai 200438,China)

机构地区:[1]复旦大学计算机科学技术学院,上海200438

出  处:《计算机科学》2025年第1期277-288,共12页Computer Science

基  金:上海市重大项目(2021SHZDZX0103)。

摘  要:基于选项(Option)的层次化策略学习是分层强化学习领域的一种主要实现方式。其中,选项表示特定动作的时序抽象,一组选项以多层次组合的方式可解决复杂的强化学习任务。针对选项发现这一目标,已有的研究工作使用监督或无监督方式从非结构化演示轨迹中自动发现有意义的选项。然而,基于监督的选项发现过程需要人为分解任务问题并定义选项策略,带来了大量的额外负担;无监督方式发现的选项则难以包含丰富语义,限制了后续选项的重用。为此,提出一种基于符号知识的选项发现方法,只需对环境符号建模,所得知识可指导环境中多种任务的选项发现,并为发现的选项赋予符号语义,从而在新任务执行时被重复使用。将选项发现过程分解为轨迹切割和行为克隆两阶段步骤:轨迹切割旨在从演示轨迹提取具备语义的轨迹片段,为此训练一个面向演示轨迹的切割模型,引入符号知识定义强化学习奖励评价切割的准确性;行为克隆根据切割得到的数据监督训练选项,旨在使选项模仿轨迹行为。使用所提方法在多个包括离散和连续空间的领域环境中分别进行了选项发现和选项重用实验。选项发现中轨迹切割部分的实验结果显示,所提方法在离散和连续空间环境中的切割准确率均高出基线方法数个百分点,并在复杂环境任务的切割中提高到20%。另外,选项重用实验的结果证明,相较于基线方法,赋予符号语义增强的选项在新任务重用上拥有更快的训练速度,并在基线方法无法完成的复杂任务中仍然得到良好收敛。ions of specific actions,and a set of options can be combined in a hierarchical manner to tackle complex reinforcement learning tasks.For the goal of option discovery,existing research has focused on the discovery of meaningful options using supervised or unsupervised methods from unstructured demonstration trajectories.However,supervised option discovery requires manual task decomposition and option policy definition,leading to a lot of additional burden.On the other hand,options discovered through unsupervised methods often lack rich semantics,limiting the subsequent reuse of options.Therefore,this paper proposes a symbol-knowledge-based option discovery method that only requires modeling the symbolic knowledge of the environment.The acquired knowledge can guide option discovery for various tasks in the environment and assign symbolic semantics to the discovered options,enabling their reuse in new task executions.This method decomposes the option discovery process into two stages:trajectory segmentation and behavior cloning.Trajectory segmentation aims to extract semantically meaningful trajectory segments from demonstration trajectories.To achieve this,a segmentation model is trained specifically for demonstration trajectories,incorporating symbolic knowledge to define the accuracy of segmentation in reinforcement learning reward evaluation.Behavior cloning,on the other hand,supervises the training of options based on the segmented data,aiming to make the options mimic trajectory behaviors.The proposed method is evaluated in multiple domain environments,including both discrete and continuous spaces,for option discovery and option reuse experiments.In the option discovery experiments,the results of trajectory segmentation show that the proposed method achieves higher segmentation accuracy compared to the baseline method,with an improvement of several percentage points in both discrete and continuous space environments.More-over,in complex environment tasks,the segmentation accuracy is further improved by 20%.Ad

关 键 词:分层强化学习 演示学习 选项发现 马尔可夫决策过程 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象