一种快速收敛的最大置信上界探索方法  

Upper Confidence Bound Exploration with Fast Convergence

在线阅读下载全文

作  者:敖天宇 刘全[1,2,3,4] AO Tian-yu;LIU Quan(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China;Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China;Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China)

机构地区:[1]苏州大学计算机科学与技术学院,江苏苏州215006 [2]苏州大学江苏省计算机信息处理技术重点实验室,江苏苏州215006 [3]吉林大学符号计算与知识工程教育部重点实验室,长春130012 [4]软件新技术与产业化协同创新中心,南京210000

出  处:《计算机科学》2022年第1期298-305,共8页Computer Science

基  金:国家自然科学基金(61772355,61702055,61502323,61502329);江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18);苏州市应用基础研究计划工业部分(SYG201422);江苏省高校优势学科建设工程资助项目。

摘  要:深度强化学习(Deep Reinforcement Learning,DRL)方法在大状态空间控制任务上取得了出色效果,探索问题一直是该领域的一个研究热点。现有探索算法存在盲目探索、学习慢等问题。针对以上问题,提出了一种快速收敛的最大置信上界探索(Upper Confidence Bound Exploration with Fast Convergence,FAST-UCB)方法。该方法使用UCB算法探索大状态空间,提高探索效率。为缓解Q值高估的问题、平衡探索与利用关系,加入了Q值截断技巧。之后,为平衡算法偏差与方差,使智能体(agent)快速学习,在网络模型中加入长短时记忆(Long Short Term Memory,LSTM)单元,同时使用一种改进混合蒙特卡洛(Mixed Monte Carlo,MMC)方法计算网络误差。最后,将FAST-UCB应用到深度Q网络(Deep Q Network,DQN),在控制类环境中将其与ε-贪心(ε-greedy)、UCB算法进行对比,以验证其有效性。在雅达利(Atari)2600环境中将其与噪声网络(Noisy-Network)探索、自举(Bootstrapped)探索、异步优势行动者评论家(Asynchronous Advantage Actor Critic,A3C)算法和近端策略优化(Proximal Policy Optimization,PPO)算法进行对比,以验证其泛化性。实验结果表明,FAST-UCB算法在这两类环境中均能取得优秀效果。Deep reinforcement learning method has achieved excellent results in large state space control tasks.Exploration has always been a research hotspot in this field.There are some problems in the existing exploration algorithms,such as blind exploration,and slow learning.To solve these problems,an upper confidence bound exploration with fast convergence(FAST-UCB)method is proposed.This method uses UCB method to explore the environment and improve the exploration efficiency.In order to alleviate the overestimation of Qvalue and balance the relationship between exploration and utilization,Qvalue clipped technique is added.Then,in order to balance the deviation and variance of the algorithm and make the agent learn quickly,the long short term memory unit is added to the network model,and an improved mixed monte carlo method is used to calculate the network error.Finally,FAST-UCB is applied to deep Q network,and compared with epsilon-greedy and UCB algorithms in control environment to verify its effectiveness.Besides,the proposed algorithm is compared with noise network exploration,bootstrapped exploration,asynchronous advantage actor critical algorithm and proximal policy optimization algorithm in Atari 2600environment to verify its generalization.The experimental results show that FAST-UCB algorithm can achieve excellent results in these two environments.

关 键 词:探索 最大置信上界 长短时记忆 混合蒙特卡洛 Q值截断 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象