检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:宋江帆 李金龙[1] Song Jiangfan;Li Jinlong(School of Computer Science&Technology,University of Science&Technology of China,Hefei 230000,China)
机构地区:[1]中国科学技术大学计算机科学与技术学院,合肥230000
出 处:《计算机应用研究》2023年第10期2928-2932,2944,共6页Application Research of Computers
摘 要:在强化学习中,策略梯度法经常需要通过采样将连续时间问题建模为离散时间问题。为了建模更加精确,需要提高采样频率,然而过高的采样频率可能会使动作改变频率过高,从而降低训练效率。针对这个问题,提出了动作稳定更新算法。该方法使用策略函数输出的改变量计算动作重复的概率,并根据该概率随机地重复或改变动作。在理论上分析了算法性能。之后在九个不同的环境中评估算法的性能,并且将它和已有方法进行了比较。该方法在其中六个环境下超过了现有方法。实验结果表明,动作稳定更新算法可以有效提高策略梯度法在连续时间问题中的训练效率。In reinforcement learning,the policy gradient algorithm often needs to model the continuous-time process as a discrete-time process through sampling.To model the problem more accurately,it improves the sampling frequency.However,the excessive sampling frequency may reduce the training efficiency.To solve this problem,this paper proposed action stable updating algorithm.This method calculated the probability of action repetition using the change of the output of the policy function,and randomly repeated or changed the action based on this probability.This paper theoretically analyzed the perfor-mance of this method.This paper evaluated the performance of this method in nine different environments and compared it with the existing methods.This method surpassed existing methods in six of these environments.The experimental results show that this method can improve the training efficiency of the policy gradient algorithm in continuous-time problems.
分 类 号:TP389.1[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.249