检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:闫雷鸣[1,2] 刘健[1,2] 朱永昕 Yan Leiming;Liu Jian;Zhu Yongxin(School of Computer Science&School of Cyber Science and Engineering,Nanjing University of Information Science&Technology,Nanjing 210044,China;Engineering Research Center of Digital Forensics Ministry of Education,Nanjing University of Information Science&Technology,Nanjing 210044,China)
机构地区:[1]南京信息工程大学计算机学院、网络空间安全学院,南京210044 [2]南京信息工程大学数字取证教育部工程研究中心,南京210044
出 处:《计算机应用研究》2025年第4期1003-1010,共8页Application Research of Computers
基 金:国家自然科学基金资助项目(62172292,42375147)。
摘 要:离线-在线强化学习旨在使用少量在线微调来提高预训练模型的性能。现有方法主要包括无约束微调与约束微调。前者往往由于分布偏移过大而导致严重的策略崩溃;后者由于保留离线约束导致性能提升缓慢,影响训练效率。为了改善上述问题,可视化对比分析两类方法的微调过程,发现不准确的Q值估计是影响性能的主要原因,并提出了一种动态策略约束的双Q值强化学习算法(DPC-DQRL)。首先,该算法设计了遵循记忆遗忘规律的动态行为克隆约束,在微调过程中动态调整约束强度;其次,构建离线-在线双Q值网络,引入离线动作价值网络参与Q值估计,提高微调过程中Q值的准确性。在Gym仿真平台使用MuJoCo物理引擎进行了Halfcheetah、Hopper、Walker2D三个经典仿真任务,使用DPC-DQRL算法微调后性能比原预训练模型分别提升47%、63%、20%,所有任务的平均归一化得分比最优基线算法提升10%。实验结果表明,DPC-DQRL在提升模型性能的同时保证了模型的稳定,与其他算法相比具有一定的优越性。Offline to online reinforcement learning focuses on improving the performance of pre-trained models through minimal online fine-tuning.Existing methods primarily adopt unconstrained or constrained fine-tuning.The unconstrained approach often results in severe policy collapse due to significant distribution shifts,while the constrained approach slows performance improvement because of strict offline constraints,reducing training efficiency.To address these limitations,this study identified inaccurate Q value estimation as a primary factor affecting performance through a comparative visualization of the fine-tuning processes of both approaches.To mitigate this issue,this paper proposed a dynamic policy-constrained double Q value reinforcement learning(DPC-DQRL)algorithm.The method incorporated a dynamic behavior cloning constraint based on a memory-forgetting mechanism,which dynamically adjusted constraint strength during fine-tuning.Furthermore,an offline-online double Q value network was constructed by integrating an offline action-value network into Q value estimation,enhancing Q value accuracy in the fine-tuning phase.Using the Gym simulation platform with the MuJoCo physics engine,DPC-DQRL was applied to fine-tune three classic tasks:Halfcheetah,Hopper,and Walker2D.The performance after fine-tuning improve by 47%,63%,and 20%,respectively,compared to the original pre-trained model.The average normalized scores across all tasks show a 10% improvement over the optimal baseline algorithm.The experimental results demonstrate that DPC-DQRL enhances model performance while maintaining stability,showcasing significant advantages over other algorithms.
关 键 词:深度强化学习 离线-在线强化学习 动态策略约束 Q值估计
分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.249