检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:乔和[1] 李增辉 刘春 胡嗣栋 Qiao He;Li Zenghui;Liu Chun;Hu Sidong(School of Electrical&Control Engineering,Liaoning Technology University,Huludao Liaoning 125105,China)
机构地区:[1]辽宁工程技术大学电气与控制工程学院,辽宁葫芦岛125105
出 处:《计算机应用研究》2024年第9期2635-2640,共6页Application Research of Computers
基 金:国家自然科学基金资助项目(51604141,51204087)。
摘 要:在深度强化学习方法中,针对内在好奇心模块(intrinsic curiosity model,ICM)指导智能体在稀疏奖励环境中获得未知策略学习的机会,但好奇心奖励是一个状态差异值,会使智能体过度关注于对新状态的探索,进而出现盲目探索的问题,提出了一种基于知识蒸馏的内在好奇心改进算法(intrinsic curiosity model algorithm based on knowledge distillation,KD-ICM)。首先,该算法引入知识蒸馏的方法,使智能体在较短的时间内获得更丰富的环境信息和策略知识,加速学习过程;其次,通过预训练教师神经网络模型去引导前向网络,得到更高精度和性能的前向网络模型,减少智能体的盲目探索。在Unity仿真平台上设计了两个不同的仿真实验进行对比,实验表明,在复杂仿真任务环境中,KD-ICM算法的平均奖励比ICM提升了136%,最优动作概率比ICM提升了13.47%,提升智能体探索性能的同时能提高探索的质量,验证了算法的可行性。In the deep reinforcement learning method,the intrinsic curiosity model(ICM)guides the agent to obtain the opportunity to learn unknown strategies in the sparse reward environment,but the curiosity reward is a state difference value,which will make the agent pay too much attention to the exploration of new states,then could be the problem of blind exploration arises.To solve the above problem,this paper proposed an intrinsic curiosity model algorithm based on knowledge distillation(KD-ICM).Firstly,it introduced the method of knowledge distillation to make the agent acquire more abundant environmental information and strategy knowledge in a short time and accelerate the learning process.Secondly,by pre-training teachers’neural network model to guide the forward network to obtain a forward network model with higher accuracy and performance,reduced the blind exploration of agents.It designed two different simulation experiments on the Unity simulation platform for comparison.The experiments show that in the complex simulation task environment,the average reward of KD-ICM algorithm is 136%higher than that of ICM,and the optimal action probability is 13.47%higher than that of ICM.Both the exploration performance of the agent and the exploration quality can be improved,and it verifies the feasibility of the algorithm.
关 键 词:深度强化学习 知识蒸馏 近端策略优化 稀疏奖励 内在好奇心
分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.33