DPC-DQRL:动态行为克隆约束的离线-在线双Q值强化学习  

DPC-DQRL:offline to online double Q value reinforcement learning with dynamic behavior cloning constraints

在线阅读下载全文

作  者:闫雷鸣[1,2] 刘健[1,2] 朱永昕 Yan Leiming;Liu Jian;Zhu Yongxin(School of Computer Science&School of Cyber Science and Engineering,Nanjing University of Information Science&Technology,Nanjing 210044,China;Engineering Research Center of Digital Forensics Ministry of Education,Nanjing University of Information Science&Technology,Nanjing 210044,China)

机构地区:[1]南京信息工程大学计算机学院、网络空间安全学院,南京210044 [2]南京信息工程大学数字取证教育部工程研究中心,南京210044

出  处:《计算机应用研究》2025年第4期1003-1010,共8页Application Research of Computers

基  金:国家自然科学基金资助项目(62172292,42375147)。

摘  要:离线-在线强化学习旨在使用少量在线微调来提高预训练模型的性能。现有方法主要包括无约束微调与约束微调。前者往往由于分布偏移过大而导致严重的策略崩溃;后者由于保留离线约束导致性能提升缓慢,影响训练效率。为了改善上述问题,可视化对比分析两类方法的微调过程,发现不准确的Q值估计是影响性能的主要原因,并提出了一种动态策略约束的双Q值强化学习算法(DPC-DQRL)。首先,该算法设计了遵循记忆遗忘规律的动态行为克隆约束,在微调过程中动态调整约束强度;其次,构建离线-在线双Q值网络,引入离线动作价值网络参与Q值估计,提高微调过程中Q值的准确性。在Gym仿真平台使用MuJoCo物理引擎进行了Halfcheetah、Hopper、Walker2D三个经典仿真任务,使用DPC-DQRL算法微调后性能比原预训练模型分别提升47%、63%、20%,所有任务的平均归一化得分比最优基线算法提升10%。实验结果表明,DPC-DQRL在提升模型性能的同时保证了模型的稳定,与其他算法相比具有一定的优越性。Offline to online reinforcement learning focuses on improving the performance of pre-trained models through minimal online fine-tuning.Existing methods primarily adopt unconstrained or constrained fine-tuning.The unconstrained approach often results in severe policy collapse due to significant distribution shifts,while the constrained approach slows performance improvement because of strict offline constraints,reducing training efficiency.To address these limitations,this study identified inaccurate Q value estimation as a primary factor affecting performance through a comparative visualization of the fine-tuning processes of both approaches.To mitigate this issue,this paper proposed a dynamic policy-constrained double Q value reinforcement learning(DPC-DQRL)algorithm.The method incorporated a dynamic behavior cloning constraint based on a memory-forgetting mechanism,which dynamically adjusted constraint strength during fine-tuning.Furthermore,an offline-online double Q value network was constructed by integrating an offline action-value network into Q value estimation,enhancing Q value accuracy in the fine-tuning phase.Using the Gym simulation platform with the MuJoCo physics engine,DPC-DQRL was applied to fine-tune three classic tasks:Halfcheetah,Hopper,and Walker2D.The performance after fine-tuning improve by 47%,63%,and 20%,respectively,compared to the original pre-trained model.The average normalized scores across all tasks show a 10% improvement over the optimal baseline algorithm.The experimental results demonstrate that DPC-DQRL enhances model performance while maintaining stability,showcasing significant advantages over other algorithms.

关 键 词:深度强化学习 离线-在线强化学习 动态策略约束 Q值估计 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象