检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张俊玉[1] 吴怡婷 夏俐 曹希仁[3] ZHANG Jun-yu;WU Yi-ting;XIA Li;CAO Xi-ren(School of Mathematics,Sun Yat-Sen University,Guangzhou Guangdong 510275,China;School of Business,Sun Yat-Sen University,Guangzhou Guangdong 510275,China;Department of Electronic and Computer Engineering,Hong Kong University of Science and Technology,Hong Kong,China)
机构地区:[1]中山大学数学学院,广东广州510275 [2]中山大学管理学院,广东广州510275 [3]香港科技大学电子与计算机工程系,中国香港
出 处:《控制理论与应用》2021年第11期1707-1716,共10页Control Theory & Applications
基 金:Supported by the National Natural Science Foundation of China(61673019,61773411,11931018,62073346);the Guangdong Province Key Laboratory of Computational Science at the Sun Yat-sen University(2020B1212060032);the Guangdong Basic and Applied Basic Research Foundation(2021A1515010057,2021A1515011984)。
摘 要:具有可数状态空间的马尔可夫决策过程(Markov decision process,MDP)在平均准则下,最优(平稳)策略不一定存在.本文研究平均准则可数状态MDP中满足最优不等式的最优策略.不同于消去折扣(因子)方法,利用离散的Dynkin公式推导本文的主要结果.首先给出遍历马氏链的泊松方程和两个零常返马氏链的例子,证明了满足两个方向相反的最优不等式的最优策略存在性.其次,通过两个比较引理和性能差分公式,证明了正常返链和多链最优策略的存在性,并进一步推广到其他情形.特别地,本文通过几个应用举例,说明平均准则性能敏感的本质.本文的结果完善了可数状态MDP在平均准则下的最优不等式的理论.For the long-run average of a Markov decision process(MDP)with countable state spaces,the optimal(stationary)policy may not exist.In this paper,we study the optimal policies satisfying optimality inequality in a countable-state MDP under the long-run average criterion.Different from the vanishing discount approach,we use the discrete Dynkin’s formula to derive the main results of this paper.We first provide the Poisson equation of an ergodic Markov chain and two instructive examples about null recurrent Markov chains,and demonstrate the existence of optimal policies for two optimality inequalities with opposite directions.Then,from two comparison lemmas and the performance difference formula,we prove the existence of optimal policies under positive recurrent chains and multi-chains,which is further extended to other situations.Especially,several examples of applications are provided to illustrate the essential of performance sensitivity of the long-run average.Our results make a supplement to the literature work on the optimality inequality of average MDPs with countable states.
关 键 词:马尔可夫决策过程 平均准则 可数状态空间 Dynkin公式 泊松方程 性能敏感
分 类 号:O211.62[理学—概率论与数理统计]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.17.74.181