机构地区:[1]苏州大学计算机科学与技术学院,江苏苏州215006 [2]符号计算与知识工程教育部重点实验室(吉林大学),长春130012
出 处:《计算机学报》2018年第1期112-131,共20页Chinese Journal of Computers
基 金:国家自然科学基金项目(61272005;61303108;61373094;61472262;61502323;61502329);江苏省自然科学基金项目(BK2012616);江苏省高校自然科学研究项目(13KJB520020);吉林大学符号计算与知识工程教育部重点实验室项目(93K172014K04);苏州市应用基础研究计划项目(SYG201422);苏州大学高校省级重点实验室基金项目(KJS1524);中国国家留学基金项目(201606920013);浙江省自然科学基金(LY16F010019)资助~~
摘 要:表格驱动的算法是解决强化学习问题的一类重要方法,但由于"维数灾"现象的存在,这种方法不能直接应用于解决具有连续状态空间的强化学习问题.解决维数灾问题的方法主要包括两种:状态空间的离散化和函数近似方法.相比函数近似,基于连续状态空间离散化的表格驱动方法具有原理直观、程序结构简单和计算轻量化的特点.基于连续状态空间离散化方法的关键是发现合适的状态空间离散化机制,平衡计算量及准确性,并且确保基于离散抽象状态空间的数值性度量,例如V值函数和Q值函数,可以较为准确地对原始强化学习问题进行策略评估和最优策略π*计算.文中提出一种基于凸多面体抽象域的自适应状态空间离散化方法,实现自适应的基于凸多面体抽象域的Q(λ)强化学习算法(Adaptive Polyhedra Domain based Q(λ),APDQ(λ)).凸多面体是一种抽象状态的表达方法,广泛应用于各种随机系统性能评估和程序数值性属性的验证.这种方法通过抽象函数,建立具体状态空间至多面体域的抽象状态空间的映射,把连续状态空间最优策略的计算问题转化为有限大小的和易于处理的抽象状态空间最优策略的计算问题.根据与抽象状态相关的样本集信息,设计了包括BoxRefinement、LFRefinement和MVLFRefinement多种自适应精化机制.依据这些精化机制,对抽象状态空间持续进行适应性精化,从而优化具体状态空间的离散化机制,产生符合在线抽样样本空间所蕴涵的统计奖赏模型.基于多面体专业计算库PPL(Parma Polyhedra Library)和高精度数值计算库GMP(GNU Multiple Precision)实现了算法APDQ(λ),并实施了实例研究.选择典型的连续状态空间强化学习问题山地车(Mountain Car,MC)和杂技机器人(Acrobatic robot,Acrobot)作为实验对象,详细评估了各种强化学习参数和自适应精化相关的阈值参数对APDQ(λ)性能的影响,探究了抽象状�The table-driven based algorithm is an important method for solving the reinforcement learning problems, but for the real world problems with continuous state spaces, the method is challenged by the curse of dimensionality, also named as the state explosion problem. Two methods have been presented for attacking the curse of dimensionality, including discretization of continuous state space and function approximation. For the usage of the discretization of continuous state space, the table-driven based algorithm is of some advantages than the function approximation based algorithms, namely straightforward principle, the implementation with concise data structure and the lightweight computation. Note that the core of algorithm is to discover a qualified discretization mechanism with which the computation cost and the accuracy of the abstract model are well balanced, and the optimal policy of an original reinforcement learning problem can be approximately derived according to its abstract state space and quantitative reward metrics. This paper presents an adaptive discretization technique based on the convex polyhedra abstraction domain, and designs an adaptive polyhedra domain based Q(λ) algorithm (APDQ(λ) on the basis of Q(λ), an important algorithm in reinforcement learning. Convex polyhedron is a qualified representation of abstract state, which is widely in performance evaluation of complex stochastic systems and verification of numerical properties of programs. The method abstracts a continuous (infinite) concrete state space into a discrete and manageable set of abstract states by defining an abstract function, such that the control problem of the original system can be resolved directly by the corresponding abstract system. Especially, some adaptive refinement operators, such as BoxRefinement, LFRefinement and MVLFRefinement, are studied, which are dependent on the online samples information for a refined abstract polyhedron state. The abstract state space is dynamically adjusted, such t
关 键 词:强化学习 凸多面体抽象域 连续状态空间 Q(λ) 自适应精化
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...