一种新的基于Sigmoid函数的分布式深度Q网络概率分布更新策略  

Novel Probability Distribution Update Strategy for Distributed Deep Q-Networks Based on Sigmoid Function

在线阅读下载全文

作  者:高卓凡 郭文利[1] GAO Zhuofan;GUO Wenli(China Aviation Industry Corporation Luoyang Electric and Optical Equipment Research Institute,Luoyang,Henan 471000,China)

机构地区:[1]中国航空工业集团公司洛阳电光设备研究所,河南洛阳471000

出  处:《计算机科学》2024年第12期277-285,共9页Computer Science

基  金:航空科学基金(2023Z015013001,2022Z015013002)。

摘  要:分布式深度Q网络(Distributed-Deep Q Network,Dist-DQN)是在传统期望值深度Q网络的基础上将离散的动作奖励在一个区间上连续化,通过不断更新支集区间的概率分布来解决复杂环境的随机奖励问题。奖励概率的分布更新策略作为Dist-DQN实现的重要函数,会显著影响智能体在环境中的学习效率。针对上述问题,提出了一种新的Sig-Dist-DQN概率分布更新策略。该策略综合考虑奖励概率支集之间的相关性强弱关系,提高与观察奖励强相关支集的概率质量更新速率,同时降低弱相关支集概率质量的更新速率。在OpenAI gym提供的环境下进行实验,结果表明,指数更新和调和序列更新策略在每次训练的差异性较大,而Sig-Dist-DQN策略的训练图像非常稳定。相较于指数更新和调和序列更新策略,应用Sig-Dist-DQN的智能体在学习过程中损失函数的收敛速度和收敛过程的稳定性都有显著提高。Based on expected value DQN,distributed deep Q network(Dist-DQN)can solve the stochastic reward problem in complex environments by continuing discrete action reward into an interval and continuously updating the probability distribution of support intervals.The distribution update strategy of reward probability,as an important function for Dist-DQN implementation,significantly affect the learning efficiency of agents in the environment.A new Sig-Dist-DQN probability distribution update strategy is proposed to address the above issues.This strategy comprehensively considers the strength of the correlation between reward probability subsets,improving the probability quality update rate of strongly correlated subsets while reducing the probability quality update rate of weakly correlated subsets.In the environment provided by OpenAI Gym,experiments are conducted,and the exponential update and harmonic series update strategies show significant differences in each training session,while the training images of the Sig-Dist-DQN strategy are very stable.Compared with the exponential update and harmonic sequence update strategies,the intelligent agent applying Sig-Dist-DQN has significantly improved the convergence speed and stability of the loss function during the learning process.

关 键 词:分布式深度Q网络 奖励区间连续化 概率分布更新 学习效率 训练稳定性 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象