基于逐次超松弛技术的Double Speedy Q-Learning算法被引量：1

Double Speedy Q-Learning Based on Successive Over Relaxation

作　　者：周琴罗飞[1] 丁炜超顾春华[1] 郑帅 ZHOU Qin;LUO Fei;DING Wei-chao;GU Chun-hua;ZHENG Shuai(School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)

机构地区：[1]华东理工大学信息科学与工程学院,上海200237

出　　处：《计算机科学》2022年第3期239-245,共7页Computer Science

基　　金：国家自然科学基金(61472139);上海汽车工业科技发展基金会产学研课题(1915)。

摘　　要：Q-Learning是目前一种主流的强化学习算法,但其在随机环境中收敛速度不佳,之前的研究针对Speedy Q-Learning存在的过估计问题进行改进,提出了Double Speedy Q-Learning算法。但Double Speedy Q-Learning算法并未考虑随机环境中存在的自循环结构,即代理执行动作时,存在进入当前状态的概率,这将不利于代理在随机环境中学习,从而影响算法的收敛速度。针对Double Speedy Q-Learning中存在的自循环结构,利用逐次超松弛技术对Double Speedy Q-Learning算法的Bellman算子进行改进,提出基于逐次超松弛技术的Double Speedy Q-Learning算法(Double Speedy Q-Learning based on Successive Over Relaxation,DSQL-SOR),进一步提升了Double Speedy Q-Learning算法的收敛速度。通过数值实验将DSQL-SOR与其他算法的实际奖励和期望奖励之间的误差进行对比,实验结果表明,所提算法比现有主流的算法SQL的误差低0.6,比逐次超松弛算法GSQL低0.5,这表明DSQL-SOR算法的性能较其他算法更优。实验同时对DSQL-SOR算法的可拓展性进行测试,当状态空间从10增加到1000时,每次迭代的平均时间增长缓慢,始终维持在10^(-4)数量级上,表明DSQL-SOR的可拓展性较强。Q-Learning is a mainstream reinforcement learning algorithm atpresent,but its convergence speed is poor in random environment.Previous studies have improved the overestimation problem of Spee-dy Q-Learning,and have proposed Double Speedy Q-Learning algorithm.However,the Double Speedy Q-Learning algorithm does not consider the self-loop structure exis-ting in the random environment,that is,the probability of entering the current state when the agent performs an action,which will not be conducive to the agent’s learning in the random environment,thereby affecting the convergence speed of the algorithm.Aiming at the self-loop structure existing in Double Speedy Q-Learning,the Bellman operator of Double Speedy Q-Learning algorithm is improved by using successive over-relaxation technology,and the Double Speedy Q-Learning algorithm based on successive over relaxation(DSQL-SOR)is proposed to further improve the convergence speed of the Double Speedy Q-Learning algorithm.By using numerical experiments to compare the error between the actual rewards and expected rewards of DSQL-SOR and other algorithms,the experimental results show that the proposed algorithm has a lower error of 0.6 than the existing mainstream algorithm SQL,which is lower than the successive over-relaxation algorithm GSQL 0.5,indicating that the performance of the DSQL-SOR algorithm is better than other algorithms.The experiment also tests the scalability of the DSQL-SOR algorithm.When the state space is increased from 10 to 1000,the average time of each iteration increases slowly,always maintaining at the magnitude of 10^(-4),indicating that DSQL-SOR has strong scalability.

关键词：强化学习 Q-LEARNING 马尔可夫决策过程逐次超松弛迭代法自循环结构

分类号：TP181[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于逐次超松弛技术的Double Speedy Q-Learning算法被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于逐次超松弛技术的Double Speedy Q-Learning算法 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于逐次超松弛技术的Double Speedy Q-Learning算法被引量：1