机构地区:[1]苏州大学计算机科学与技术学院,江苏苏州215006 [2]软件新技术与产业化协同创新中心,南京210000 [3]吉林大学符号计算与知识工程教育部重点实验室,长春130012 [4]苏州大学江苏省计算机信息处理技术重点实验室,江苏苏州215006
出 处:《计算机学报》2019年第10期2203-2215,共13页Chinese Journal of Computers
基 金:国家自然科学基金(61772355,61702055,61502323,61502329);江苏省高等学校自然科学研究重大项目(17KJA520004、18KJA520011);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18);苏州市应用基础研究计划工业部分(SYG201422);苏州市重点产业技术创新-前瞻性应用研究项目(SYG201804);江苏省高校省级重点实验室(苏州大学)(KJS1524)资助~~
摘 要:在线强化学习中,值函数的逼近通常采用随机梯度下降(Stochastic Gradient Descent,SGD)方法.在每个时间步,SGD方法使用强化学习算法获取随机样本,计算损失函数的局部梯度,单次模型参数更新的计算量小,适合在线学习.但是,由于目标函数不同维度存在梯度差异,SGD方法会产生优化震荡,导致迭代次数增多,收敛速度变慢甚至不能收敛.本文提出一种带自适应学习率的综合随机梯度下降方法(Adaptive Learning Rate on Integrated Stochastic Gradient Descent,ALRI-SGD),对SGD做了两方面改进:(1)在基于参数预测的基础上,利用历史随机梯度信息综合计算当前时间步的更新梯度;(2)根据不同维度的历史梯度信息,动态计算每个维度的学习率.在一定的数学约束条件下,证明了ALRI-SGD方法的收敛性.把ALRI-SGD方法与基于线性函数逼近的离策略Q-学习算法结合,用于求解强化学习中经典的Mountain Car问题和平衡杆问题,并与基于SGD的Q-学习算法进行实验比较.实验结果表明,ALRI-SGD方法能动态匹配模型参数在不同维度上的梯度差异,并使学习率自动更新以适应不同维度的数据特征.ALRI-SGD方法在收敛效率和收敛稳定性两个方面都有提升.The basic idea of reinforcement learning is to learn the best strategy to reach the goal by maximizing the cumulative rewards that the agent receives from the environment.In the online reinforcement learning method based on value function approximation,SGD is used to update the weights of the value function,that is,to update the model parameters in the negative gradient direction to minimize the loss function.On each time-step a random sample is obtained according to theε-greedy strategy to update the model parameters.Therefore,each update requires a small amount of computation and which is suitable for online learning.Due to the difference in gradient rate of the objective function in different dimensions,SGD may make the optimization goal converge to another extreme point.It is also difficult for SGD to choose a suitable learning rate.A learning rate that is too small leads to a slower convergence rate,and too large leads to an obstacle to convergence.If different dimensions of training data have different characteristics and value statistics space,different dimensions should adopt different learning rates.Thus,the drawback of the stochastic gradient descent method is that it sometimes brings about the optimization oscillation,which makes the number of iterations increases and the convergence rate slows down.In this paper,we proposed an Adaptive Learning Rate on Integrated Stochastic Gradient Descent-ALRI-SGD,which makes two improvements on the traditional SGD:Add the gradient at time t-1 to the current t time gradient,to update the parameters as an integrated gradient.Based on the prediction of model parameters,the historical gradient information is used to calculate the gradient of the current time-step.This improvement makes the oscillation reduced in the direction with a larger gradient,and the speed of approaching an extremum faster in the direction with a smaller gradient.In the same dimension,it makes the parameter update faster before it approaches the extremum.If the parameter exceeds the extremum and
关 键 词:强化学习 综合随机梯度下降 自适应学习率 参数预测 Q-学习
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...