检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李茂捷 徐国政 高翔[1,2] 谭彩铭 LI Maojie;XU Guozheng;GAO Xiang;TAN Caiming(College of Automation&College of Artificial Intelligence,Nanjing University of Posts and Telecommunications,Nanjing 210023,China;Robotics Information Sensing and Control Institute,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)
机构地区:[1]南京邮电大学自动化学院、人工智能学院,江苏南京210023 [2]南京邮电大学机器人信息感知与控制研究所,江苏南京210023
出 处:《南京邮电大学学报(自然科学版)》2023年第1期96-103,共8页Journal of Nanjing University of Posts and Telecommunications:Natural Science Edition
基 金:江苏省自然科学基金(BK20210599);江苏省高等学校自然科学研究项目(20KJB510023)资助项目。
摘 要:针对深度强化学习方法在机械臂的接近技能学习中普遍存在的样本效率低、泛化性差的问题,提出一种基于元Q学习的技能学习方法。首先利用结合后视经验回放(Hindsight Experience Replay, HER)的DDPG训练机械臂以指定姿态到达目标点,验证了算法在接近任务中的有效性;其次,在相关任务集上构造多任务目标作为优化对象,利用结合HER的DDPG训练模型,得到泛化性强的元训练模型和元训练数据,此外利用GRU获取轨迹上下文变量;最后,先在新任务上进行少量训练,再利用元训练数据训练模型进一步提升性能。仿真实验表明,在初始性能、学习速率和收敛性能三方面元Q学习均带来明显提升,其中达到期望性能所需样本量降低77%,平均成功率提高15%。Since the deep reinforcement learning methods that manipulators employ to learning reaching skills perform at low sample efficiency and poor generalization, a skill learning method based on the meta-Q learning is proposed. First, the deep deterministic policy gradient(DDPG) combined with the hindsight experience replay(HER) is used to train a manipulator to reach the target point with a specified attitude. It verifies the effectiveness of the algorithm in reaching tasks. Second, a multi-task objective is constructed on the relevant task set and designated as the optimization object. DDPG combined with HER is used to train the model and obtain meta-training data and a meta-training model with strong generalization. GRU is also used to obtain trajectory context variables. Finally, a small amount of training is performed on the new task, and then the meta-training data are used to train the model to further improve the performance. Simulation experiments show that the meta-Q-learning brings significant improvements in the initial performance, learning rate and convergence performance. The sample size required to achieve the desired performance is reduced by 77%, and the average success rate is increased by 15%.
关 键 词:机器人学习 元强化学习 深度确定性策略梯度 元Q学习 样本效率
分 类 号:TP242.6[自动化与计算机技术—检测技术与自动化装置]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28