基于线性动态跳帧的深度双Q网络  被引量:2

Deep Double Q-Network Based on Linear Dynamic Frame Skip

在线阅读下载全文

作  者:陈松 章晓芳[1,2] 章宗长 刘全[1,3] 吴金金 闫岩 CHEN Song;ZHANG Xiao-Fang;ZHANG Zong-Zhang;LIU Quan;WU Jin-Jin;YAN Yan(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006;State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210023;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012)

机构地区:[1]苏州大学计算机科学与技术学院,江苏苏州215006 [2]南京大学计算机软件新技术国家重点实验室,南京210023 [3]吉林大学符号计算与知识工程教育部重点实验室,长春130012

出  处:《计算机学报》2019年第11期2561-2573,共13页Chinese Journal of Computers

基  金:国家自然科学基金项目(61472262,61502329,61772355,61876119);江苏省自然科学基金面上项目(BK20181432);吉林大学符号计算与知识工程教育部重点实验室基金项目(93K172014K04,93K172017K18);苏州市重点产业技术创新-前瞻性应用研究项目(SYG201807)资助~~

摘  要:深度Q网络模型在处理需要感知高维输入数据的决策控制任务中性能良好.然而,在深度Q网络及其改进算法中基本使用静态的跳帧方法,即动作被重复执行固定的次数.另外,优先级经验重放是对均匀采样的一种改进,然而目前各个研究仅将样本的时间差分误差作为评价优先级的标准.针对这两个问题,该文提出一种基于线性动态跳帧和改进的优先级经验重放的深度双Q网络.该算法使得跳帧率成为一个可动态学习的参数,跳帧率随网络输出Q值的大小线性增长,Agent将根据当前状态和动作来动态地确定一个动作被重复执行的次数,并利用经验池中样本的每个动作的跳帧率和样本的时间差分误差共同决定样本的优先级.最后在Atari 2600游戏中进行实验,结果表明该算法相比于传统动态跳帧和优先级经验重放算法具有更优的效果.Deep Q-Network is able to perform human-level control for handling problems requiring both rich perception of high-dimensional raw inputs and policy control.However,the current state of the art architectures like Deep Q-Network and it improved algorithms adopt a traditional framework with a static frame skip rate,where the action output from the network is repeated for a fixed number of frames regardless of the current state.Although Dynamic Frame skip Deep Q-Network uses a dynamic frame skip rate,it doubles the number of nodes in the network output layer with a frame skip rate of 4 or 20.Such settings may cause an increase in the amount of computation of the network,and cause bad actions to be performed multiple times,thereby affecting the efficiency of learning.In addition,an important technique in Deep Q-Network is the use of an experience replay mechanism.The traditional method of uniformly sampling samples ignores the importance of samples.In order to increase the sampling rate of important samples,the prioritized experience replay is an improvement on uniform sampling,using only the temporal difference error of a sample as the criterion of evaluation priority.However,such priority evaluation criterion only considers the temporal difference error of the sample,and there may be other factors that affect the priority of the sample.In this paper,we propose a new algorithm:Deep Double Q-Network based on Linear Dynamic Frame Skip and Improved Prioritized Experiential Replay(LDF-IPER-DDQN in short).The value of the frame-skip rate increases linearly with the magnitude of the network output Q value,which allows Agent to dynamically select the number of times an action is repeated based on the current state and action.For the action with the largest Q value,the maximum frame skip rate is given to this action.In contrast,the action with the smallest Q value is given the minimum frame skip rate.In this way,the frame skip rate becomes a dynamic learnable parameter.Furthermore,the value of the frame skip rate for each a

关 键 词:深度强化学习 深度Q网络 动态跳帧 优先级经验重放 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象