检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]中国矿业大学信息与电气工程学院,徐州221116
出 处:《自动化学报》2011年第1期44-51,共8页Acta Automatica Sinica
基 金:国家自然科学基金(60804022;60974050;61072094);教育部新世纪优秀人才支持计划(NCET-08-0836);霍英东教育基金会青年教师基金(121066);江苏省自然科学基金(BK2008126)资助~~
摘 要:在策略迭代强化学习中,基函数构造是影响动作值函数逼近精度的一个重要因素.为了给动作值函数逼近提供合适的基函数,提出一种基于状态-动作图测地高斯基的策略迭代强化学习方法.首先,根据离策略方法建立马尔可夫决策过程的状态-动作图论描述;然后,在状态-动作图上定义测地高斯核函数,利用基于近似线性相关的核稀疏方法自动选择测地高斯核的中心;最后,在策略评估阶段利用基于状态-动作图的测地高斯核逼近动作值函数,并基于估计的值函数进行策略改进.10×10格子世界的仿真结果表明,与基于状态图普通高斯基和测地高斯基的策略迭代强化学习方法相比,本文所提方法能以较少的基函数、高精度地逼近具有光滑且不连续特性的动作值函数,从而有效地获得最优策略.For policy iteration reinforcement learning methods,the construction of basis functions is an important factor of influencing the accuracy of action-value function approximation.In order to construct appropriate basis functions for the action-value function approximation,a policy iteration reinforcement learning method based on geodesic Gaussian basis defined on state-action graph is proposed.At first,a state-action graph for a Markov decision process is constructed according to an off-policy method.Secondly,geodesic Gaussian kernel functions are defined on the state-action graph and a kernel sparsification approach based on approximate linear dependency is used to automatically select centers of the geodesic Gaussian kernels.At last,the geodesic Gaussian kernels based on the state-action graph is used to approximate the action-value function during the process of policy evaluation,and then the policy is improved based on the estimated action-value function.Simulation results concerning a 10 × 10 grid-world illustrate that the proposed method can accurately approximate the action-value function having smoothness and discontinuity properties with less basis functions as compared with the policy iteration reinforcement learning methods based on either ordinary Gaussian basis or geodesic Gaussian basis defined on a state graph,which is helpful for obtaining an optimal policy effectively.
关 键 词:状态-动作图 测地高斯核 基函数 策略迭代 强化学习
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.15