机构地区:[1]苏州大学计算机科学与技术学院,江苏苏州215006 [2]软件新技术与产业化协同创新中心,南京210000
出 处:《计算机学报》2018年第1期1-27,共27页Chinese Journal of Computers
基 金:国家自然科学基金(61472262;61303108;61373094;61502323;61502329;61772355);苏州市应用基础研究计划工业部分(SYG201422;SYG201308)资助;the Natural Science Foundation of Jiangsu(BK2012616);the High School Natural Foundation of Jiangsu(13KJB520020;16KJB520041);the Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education;Jilin University(93K172014K04)~~
摘 要:深度强化学习是人工智能领域的一个新的研究热点.它以一种通用的形式将深度学习的感知能力与强化学习的决策能力相结合,并能够通过端对端的学习方式实现从原始输入到输出的直接控制.自提出以来,在许多需要感知高维度原始输入数据和决策控制的任务中,深度强化学习方法已经取得了实质性的突破.该文首先阐述了三类主要的深度强化学习方法,包括基于值函数的深度强化学习、基于策略梯度的深度强化学习和基于搜索与监督的深度强化学习;其次对深度强化学习领域的一些前沿研究方向进行了综述,包括分层深度强化学习、多任务迁移深度强化学习、多智能体深度强化学习、基于记忆与推理的深度强化学习等.最后总结了深度强化学习在若干领域的成功应用和未来发展趋势.Deep reinforcement learning (DRL) is a new research hotspot in the artificial intelligence community. By using a general-purpose form, DRL integrates the advantages of the perception of deep learning (DL) and the decision making of reinforcement learning (RL), and gains the output control directly based on raw inputs by the end-to-end learning process. DRL has made substantial breakthroughs in a variety of tasks requiring both rich perception of high-dimensional raw inputs and policy control since it was proposed. In this paper, we systematically describe three main categories of DRL methods. Firstly, we summarize value-based DRL methods. The core idea behind them is to approximate the value function by using deep neural networks which have strong ability of perception. We introduce an epoch-making value-based DRL method called Deep Q-Network(DQN) and its variants. These variants are divided into two categories: improvements of training algorithm and improvements of model architecture. The first category includes Deep Double Q-Network (DDQN), DQN based on advantage learning technique, and DDQN with proportional prioritization. The second one includes Deep Recurrent Q-Network (DRQN) and a method based on Dueling Network architecture. In general, value-based DRL methods are good at dealing with large-scale problems with discrete action spaces. We then summarize policy-based DRL methods. Their powerful idea is to use deep neural networks to parameterize the policies and optimization methods to optimize the policies. In this part, we firstly highlight some pure policy gradient methods, then focus on a series of policy-based DRL algorithms which use the actor- critic framework e. g. , Deep Deterministic Policy Gradient (DDPG), followed by an effective method named Asynchronous Advantage Actor-Critic (A3C) with the benefit of reducing the training time dramatically. Compared to value-based methods, policy-based DRL methods have a wider range of successful applications in complex proble
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...