检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张建行 刘全[1,2,3,4] ZHANG Jian-hang;LIU Quan(School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China;Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China;Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China)
机构地区:[1]苏州大学计算机科学与技术学院,江苏苏州215006 [2]苏州大学江苏省计算机信息处理技术重点实验室,江苏苏州215006 [3]吉林大学符号计算与知识工程教育部重点实验室,长春130012 [4]软件新技术与产业化协同创新中心,南京210000
出 处:《计算机科学》2021年第10期37-43,共7页Computer Science
基 金:国家自然科学基金(61772355,61702055,61502323,61502329);江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18);苏州市应用基础研究计划工业部分(SYG201422);江苏省高校优势学科建设工程资助项目。
摘 要:强化学习中的连续控制问题一直是近年来的研究热点。深度确定性策略梯度(Deep Deterministic Policy Gradients,DDPG)算法在连续控制任务中表现优异。DDPG算法利用经验回放机制训练网络模型,为了进一步提高经验回放机制在DDPG算法中的效率,将情节累积回报作为样本分类依据,提出一种基于情节经验回放的深度确定性策略梯度(Deep Determinis-tic Policy Gradient with Episode Experience Replay,EER-DDPG)方法。首先,将经验样本以情节为单位进行存储,根据情节累积回报大小使用两个经验缓冲池分类存储。然后,在网络模型训练阶段着重对累积回报较大的样本进行采样,以提升训练质量。在连续控制任务中对该方法进行实验验证,并与采取随机采样的DDPG方法、置信区域策略优化(Trust Region Policy Optimization,TRPO)方法以及近端策略优化(Proximal Policy Optimization,PPO)方法进行比较。实验结果表明,EER-DDPG方法有更好的性能表现。The research on continuous control in reinforcement learning has been a hot topic in recent years.The deep deterministic policy gradient(DDPG)algorithm performs well in continuous control tasks.DDPG algorithm uses experience replay mechanism to train the network model,and in order to further improve the efficiency of experience replay mechanism in the DDPG algorithm,the cumulative reward is used as the transition classification basis,a deep deterministic policy gradient with episodic experien ce replay(EER-DDPG)algorithm is proposed.First of all,the transitions are stored in the unit of episode,and two replay buffers are introduced respectively to classify the transitions according to the cumulative reward.Then,the quality of policy can be improved in network model training period by random sampling of the episodes with large cumulative reward.In the continuous control tasks,this algorithm is verified by experiments,and compared with DDPG algorithm,trust region policy optimization(TRPO)algorithm and proximal policy optimization(PPO)algorithm.The experimental results show that EER-DDPG algorithm has better performance.
关 键 词:深度确定性策略梯度 连续控制任务 经验回放 累积回报 分类经验回放
分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7