检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:周琴 罗飞[1] 丁炜超 顾春华[1] 郑帅 ZHOU Qin;LUO Fei;DING Wei-chao;GU Chun-hua;ZHENG Shuai(School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)
机构地区:[1]华东理工大学信息科学与工程学院,上海200237
出 处:《计算机科学》2022年第3期239-245,共7页Computer Science
基 金:国家自然科学基金(61472139);上海汽车工业科技发展基金会产学研课题(1915)。
摘 要:Q-Learning是目前一种主流的强化学习算法,但其在随机环境中收敛速度不佳,之前的研究针对Speedy Q-Learning存在的过估计问题进行改进,提出了Double Speedy Q-Learning算法。但Double Speedy Q-Learning算法并未考虑随机环境中存在的自循环结构,即代理执行动作时,存在进入当前状态的概率,这将不利于代理在随机环境中学习,从而影响算法的收敛速度。针对Double Speedy Q-Learning中存在的自循环结构,利用逐次超松弛技术对Double Speedy Q-Learning算法的Bellman算子进行改进,提出基于逐次超松弛技术的Double Speedy Q-Learning算法(Double Speedy Q-Learning based on Successive Over Relaxation,DSQL-SOR),进一步提升了Double Speedy Q-Learning算法的收敛速度。通过数值实验将DSQL-SOR与其他算法的实际奖励和期望奖励之间的误差进行对比,实验结果表明,所提算法比现有主流的算法SQL的误差低0.6,比逐次超松弛算法GSQL低0.5,这表明DSQL-SOR算法的性能较其他算法更优。实验同时对DSQL-SOR算法的可拓展性进行测试,当状态空间从10增加到1000时,每次迭代的平均时间增长缓慢,始终维持在10^(-4)数量级上,表明DSQL-SOR的可拓展性较强。Q-Learning is a mainstream reinforcement learning algorithm atpresent,but its convergence speed is poor in random environment.Previous studies have improved the overestimation problem of Spee-dy Q-Learning,and have proposed Double Speedy Q-Learning algorithm.However,the Double Speedy Q-Learning algorithm does not consider the self-loop structure exis-ting in the random environment,that is,the probability of entering the current state when the agent performs an action,which will not be conducive to the agent’s learning in the random environment,thereby affecting the convergence speed of the algorithm.Aiming at the self-loop structure existing in Double Speedy Q-Learning,the Bellman operator of Double Speedy Q-Learning algorithm is improved by using successive over-relaxation technology,and the Double Speedy Q-Learning algorithm based on successive over relaxation(DSQL-SOR)is proposed to further improve the convergence speed of the Double Speedy Q-Learning algorithm.By using numerical experiments to compare the error between the actual rewards and expected rewards of DSQL-SOR and other algorithms,the experimental results show that the proposed algorithm has a lower error of 0.6 than the existing mainstream algorithm SQL,which is lower than the successive over-relaxation algorithm GSQL 0.5,indicating that the performance of the DSQL-SOR algorithm is better than other algorithms.The experiment also tests the scalability of the DSQL-SOR algorithm.When the state space is increased from 10 to 1000,the average time of each iteration increases slowly,always maintaining at the magnitude of 10^(-4),indicating that DSQL-SOR has strong scalability.
关 键 词:强化学习 Q-LEARNING 马尔可夫决策过程 逐次超松弛迭代法 自循环结构
分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.15