用于连续时间中策略梯度算法的动作稳定更新算法  被引量:2

Action stable updating algorithm for policy gradient methods in continuous time

在线阅读下载全文

作  者:宋江帆 李金龙[1] Song Jiangfan;Li Jinlong(School of Computer Science&Technology,University of Science&Technology of China,Hefei 230000,China)

机构地区:[1]中国科学技术大学计算机科学与技术学院,合肥230000

出  处:《计算机应用研究》2023年第10期2928-2932,2944,共6页Application Research of Computers

摘  要:在强化学习中,策略梯度法经常需要通过采样将连续时间问题建模为离散时间问题。为了建模更加精确,需要提高采样频率,然而过高的采样频率可能会使动作改变频率过高,从而降低训练效率。针对这个问题,提出了动作稳定更新算法。该方法使用策略函数输出的改变量计算动作重复的概率,并根据该概率随机地重复或改变动作。在理论上分析了算法性能。之后在九个不同的环境中评估算法的性能,并且将它和已有方法进行了比较。该方法在其中六个环境下超过了现有方法。实验结果表明,动作稳定更新算法可以有效提高策略梯度法在连续时间问题中的训练效率。In reinforcement learning,the policy gradient algorithm often needs to model the continuous-time process as a discrete-time process through sampling.To model the problem more accurately,it improves the sampling frequency.However,the excessive sampling frequency may reduce the training efficiency.To solve this problem,this paper proposed action stable updating algorithm.This method calculated the probability of action repetition using the change of the output of the policy function,and randomly repeated or changed the action based on this probability.This paper theoretically analyzed the perfor-mance of this method.This paper evaluated the performance of this method in nine different environments and compared it with the existing methods.This method surpassed existing methods in six of these environments.The experimental results show that this method can improve the training efficiency of the policy gradient algorithm in continuous-time problems.

关 键 词:强化学习 连续时间 策略梯度 动作重复 

分 类 号:TP389.1[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象