检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]合肥工业大学计算机与信息学院,安徽合肥230009
出 处:《控制理论与应用》2006年第2期292-296,共5页Control Theory & Applications
基 金:国家自然科学基金资助项目(60404009);安徽省自然科学基金资助项目(050420303);合肥工业大学中青年科技创新群体计划资助项目
摘 要:为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行逼近策略评估;然后,根据性能势的逼近值,通过逼近策略迭代来实现两种准则下统一的神经元动态规划(neuro-dynam ic programm ing,NDP)优化方法.研究结果适用于半M arkov决策过程,并通过一个数值例子,说明了文中的神经元策略迭代算法对两种准则都适用,验证了平均问题是折扣问题当折扣因子趋近于零时的极限情况.Motivated by the need of practical large-scale Markov systems, we considered in this paper the learning optimization problems for Markov decision processes (MDPs). Based on the definition of performance potentials, a unified formula of temporal difference is provided for both average and discounted performance criteria. A neural network is then used to represent the estimation of potentials, both the parameterized TD (0) learning formulas and algorithm are also derived for approximating the policy evaluation. By the approximation values of potentials and approximation policy iteration, a unified neuro-dynamic programming (NDP) optimization approach is consequently proposed for both two criteria. The obtained results can be extended to semi-Markov decision processes, and a numerical example is finally used to illustrate the application of the proposed neuro-policy iteration algorithm for both average and discounted criteria. The example also shows that the average problem is the limitation case of the discount ones as discount factor goes to zero.
关 键 词:MARKOV决策过程 性能势 TD(0)学习 神经元动态规划
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49