不确定性环境下基于进化算法的强化学习  被引量:12

Evolutionary Algorithm Based Reinforcement Learning in the Uncertain Environments

在线阅读下载全文

作  者:刘海涛[1] 洪炳熔[1] 朴松昊[1] 王雪梅[2] 

机构地区:[1]哈尔滨工业大学计算机科学与技术学院,黑龙江哈尔滨150001 [2]哈尔滨理工大学自动化学院,黑龙江哈尔滨150080

出  处:《电子学报》2006年第7期1356-1360,共5页Acta Electronica Sinica

基  金:国家863计划资助项目(No.2002AA735041)

摘  要:不确定性和隐状态是目前强化学习所要面对的重要难题.本文提出了一种新的算法MA-Q-learning算法来求解带有这种不确定性的POMDP问题近似最优策略.利用M em etic算法来进化策略,而Q学习算法得到预测奖励来指出进化策略的适应度值.针对隐状态问题,通过记忆agent最近经历的确定性的有限步历史信息,与表示所有可能状态上的概率分布的信度状态相结合,共同决策当前的最优策略.利用一种混合搜索方法来提高搜索效率,其中调整因子被用于保持种群的多样性,并且指导组合式交叉操作与变异操作.在POMDP的Benchm ark实例上的实验结果证明本文提出的算法性能优于其他的POMDP近似算法.Reinforcement learning(RL) problems with uncertainty and hidden state present significant obstacles to prevailing RL methods. In this paper, a novel approximate algorithm, called Memetic algorithm based Q-Learning (MA-Q-Leaming) ,is proposed as a means to solve the POMDP problems which has such uncertainty problems. The policies are evolved using memetic algorithms, whereas the improved Q-learning obtains predictive rewards to indicate fitness of the evolved policies. In order to solve the hidden state problem, historical information is incorporated with the current belief state to aid in finding the optimal policy. Finally, the search efficiency is improved by a hybrid search method,in which an adjustment factor is used to help keep the diversity of population and guide the crossover based on the combination of multiple kinds of crossover and mutation. The experiments conducted on benchmark datasets show that the proposed methodology is superior to other state-of-the-art POMDP approximate methods.

关 键 词:部分可观察马尔可夫决策过程 Q学习 MEMETIC算法 信度状态 隐状态 

分 类 号:TP319[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象