平均和折扣准则MDP基于TD(0)学习的统一NDP方法  被引量:5

Unified NDP method based on TD(0) learning for both average and discounted Markov decision processes

在线阅读下载全文

作  者:唐昊[1] 周雷[1] 袁继彬[1] 

机构地区:[1]合肥工业大学计算机与信息学院,安徽合肥230009

出  处:《控制理论与应用》2006年第2期292-296,共5页Control Theory & Applications

基  金:国家自然科学基金资助项目(60404009);安徽省自然科学基金资助项目(050420303);合肥工业大学中青年科技创新群体计划资助项目

摘  要:为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行逼近策略评估;然后,根据性能势的逼近值,通过逼近策略迭代来实现两种准则下统一的神经元动态规划(neuro-dynam ic programm ing,NDP)优化方法.研究结果适用于半M arkov决策过程,并通过一个数值例子,说明了文中的神经元策略迭代算法对两种准则都适用,验证了平均问题是折扣问题当折扣因子趋近于零时的极限情况.Motivated by the need of practical large-scale Markov systems, we considered in this paper the learning optimization problems for Markov decision processes (MDPs). Based on the definition of performance potentials, a unified formula of temporal difference is provided for both average and discounted performance criteria. A neural network is then used to represent the estimation of potentials, both the parameterized TD (0) learning formulas and algorithm are also derived for approximating the policy evaluation. By the approximation values of potentials and approximation policy iteration, a unified neuro-dynamic programming (NDP) optimization approach is consequently proposed for both two criteria. The obtained results can be extended to semi-Markov decision processes, and a numerical example is finally used to illustrate the application of the proposed neuro-policy iteration algorithm for both average and discounted criteria. The example also shows that the average problem is the limitation case of the discount ones as discount factor goes to zero.

关 键 词:MARKOV决策过程 性能势 TD(0)学习 神经元动态规划 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象