机构地区:[1]苏州大学计算机科学与技术学院,江苏苏州215006 [2]苏州大学江苏省计算机信息处理技术重点实验室,江苏苏州215006
出 处:《计算机学报》2023年第10期2066-2083,共18页Chinese Journal of Computers
基 金:国家自然科学基金(62376179,61772355,61702055,61876217,62176175);新疆维吾尔自治区自然科学基金(2022D01A238);江苏高校优势学科建设工程资助项目(PAPD)资助.
摘 要:近年来,深度强化学习在控制任务中取得了显著的效果.但受限于探索能力,难以快速且稳定地求解复杂任务.分层强化学习作为深度强化学习的重要分支,主要解决大规模问题.但是仍存在先验知识设定的不合理和无法有效平衡探索与利用等难题.针对以上问题,提出优势加权互信息最大化的最大熵分层强化学习(Maximum Entropy Hierarchical Reinforcement Learning with Advantage-weighted Mutual Information Maximization,HRLAMIM)算法.该算法通过优势函数加权重要性采样与互信息最大化,解决由策略引起的样本聚类问题,增加内部奖励来强调Option的多样性.同时,将奖励引入最大熵强化学习目标,使策略具有了更强的探索性和更好的稳定性.此外,采用Option数量退火方法,不仅减少了先验知识对性能的影响,还平衡了算法的探索与利用,并获得了更高的样本效率和更快的学习速度.将HRL-AMIM算法应用于Mujoco任务中,实验表明,与传统深度强化学习算法和同类型的分层强化学习算法相比,HRL-AMIM算法在性能和稳定性方面均具有较大的优势.进一步通过消融实验和超参数敏感性实验,验证了算法的鲁棒性和有效性.Reinforcement learning is a significant research area in machine learning.By interacting with the environment,agents can adapt to the dynamic environment.At the same time,this interactive learning approach allows an agent to progressively optimize its policy,which is promising for a wide range of applications.Deep reinforcement learning,a method that combines reinforcement learning with deep learning,plays a crucial role in artificial intelligence.This combination enables agents to learn and make autonomous decisions in complex and dynamic environments without complex supervised data.In recent years,deep reinforcement learning has achieved remarkable results in games and complex control tasks.For example,Deep Q Learning(DQN)algorithm uses a convolutional neural network to process the visual input from the game screen and continuously updates the policy through a Q-learning algorithm.In Atari 2600 games,the DQN can learn advanced game strategies autonomously by looking at the game screen pixel information,even without human expert guidance.However,DQN is only applicable to discrete action space tasks.To solve this problem,Deep Deterministic Policy Gradient(DDPG)combines deterministic policy gradient algorithms with DQN algorithms to achieve policy optimization and learning in continuous action spaces.Twin Delayed Deep Deterministic Policy Gradient(TD3)algorithm uses a clipped double Q network to prevent the value function from being overestimated.Moreover,it introduces delayed policy updates and targeted policy smoothing to improve policy learning stability and exploratory power.Soft Actor-Critic(SAC)algorithm achieves efficient learning over a continuous action space by simultaneously learning a policy network and a value function network,combined with entropy regularization.The algorithm provides a useful learning framework for solving large-scale problems.However,deep reinforcement learning is difficult to solve complex tasks quickly and stably due to the limited exploration capability.Hierarchical reinforcemen
关 键 词:深度强化学习 分层强化学习 优势加权 互信息 最大熵
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...