基于元Q学习与DDPG的机械臂接近技能学习方法被引量：2

A reaching skill learning method of manipulators based on meta-Q-learning and DDPG

作　　者：李茂捷徐国政高翔[1,2] 谭彩铭 LI Maojie;XU Guozheng;GAO Xiang;TAN Caiming(College of Automation&College of Artificial Intelligence,Nanjing University of Posts and Telecommunications,Nanjing 210023,China;Robotics Information Sensing and Control Institute,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)

机构地区：[1]南京邮电大学自动化学院、人工智能学院,江苏南京210023 [2]南京邮电大学机器人信息感知与控制研究所,江苏南京210023

出　　处：《南京邮电大学学报（自然科学版）》2023年第1期96-103,共8页Journal of Nanjing University of Posts and Telecommunications：Natural Science Edition

基　　金：江苏省自然科学基金(BK20210599);江苏省高等学校自然科学研究项目(20KJB510023)资助项目。

摘　　要：针对深度强化学习方法在机械臂的接近技能学习中普遍存在的样本效率低、泛化性差的问题,提出一种基于元Q学习的技能学习方法。首先利用结合后视经验回放(Hindsight Experience Replay, HER)的DDPG训练机械臂以指定姿态到达目标点,验证了算法在接近任务中的有效性;其次,在相关任务集上构造多任务目标作为优化对象,利用结合HER的DDPG训练模型,得到泛化性强的元训练模型和元训练数据,此外利用GRU获取轨迹上下文变量;最后,先在新任务上进行少量训练,再利用元训练数据训练模型进一步提升性能。仿真实验表明,在初始性能、学习速率和收敛性能三方面元Q学习均带来明显提升,其中达到期望性能所需样本量降低77%,平均成功率提高15%。Since the deep reinforcement learning methods that manipulators employ to learning reaching skills perform at low sample efficiency and poor generalization, a skill learning method based on the meta-Q learning is proposed. First, the deep deterministic policy gradient(DDPG) combined with the hindsight experience replay(HER) is used to train a manipulator to reach the target point with a specified attitude. It verifies the effectiveness of the algorithm in reaching tasks. Second, a multi-task objective is constructed on the relevant task set and designated as the optimization object. DDPG combined with HER is used to train the model and obtain meta-training data and a meta-training model with strong generalization. GRU is also used to obtain trajectory context variables. Finally, a small amount of training is performed on the new task, and then the meta-training data are used to train the model to further improve the performance. Simulation experiments show that the meta-Q-learning brings significant improvements in the initial performance, learning rate and convergence performance. The sample size required to achieve the desired performance is reduced by 77%, and the average success rate is increased by 15%.

关键词：机器人学习元强化学习深度确定性策略梯度元Q学习样本效率

分类号：TP242.6[自动化与计算机技术—检测技术与自动化装置]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于元Q学习与DDPG的机械臂接近技能学习方法被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于元Q学习与DDPG的机械臂接近技能学习方法 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于元Q学习与DDPG的机械臂接近技能学习方法被引量：2