检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张晶[1] 王子铭 任永功[1] Zhang Jing;Wang Ziming;Ren Yonggong(School of Computer Science and Artificial Intelligence,Liaoning Normal University,Dalian,Liaoning 116081)
机构地区:[1]辽宁师范大学计算机与人工智能学院,辽宁大连116081
出 处:《计算机研究与发展》2023年第6期1373-1384,共12页Journal of Computer Research and Development
基 金:国家自然科学基金项目(61902165,61976109);大连市科技创新基金项目(2018J12GX047);大连市重点实验室专项。
摘 要:异步优势演员评论家(asynchronous advantage actor-critic,A3C)构建一主多从异步并行深度强化学习框架,其在最优策略探索中存在求解高方差问题,使主智能体难以保证全局最优参数更新及最佳策略学习.同时,利用百万计算资源构建的大规模并行网络,难以部署低功耗近端平台.针对上述问题,提出紧凑异步优势演员评论家(Compact_A3C)模型,实现模型压缩及知识抽取.该模型冻结并评价A3C框架中所有子智能体学习效果,将评价结果转化为主智能体更新概率,保证全局最优策略获取,提升大规模网络资源利用率.进一步,模型将优化主智能体作为“教师网络”,监督小规模“学生网络”前期探索与策略引导,并构建线性衰减损失函数鼓励“学生网络”对复杂环境自由探索,强化自主学习能力,实现大规模A3C模型知识抽取及网络压缩.建立不同压缩比“学生网络”,在流行Gym Classic Control与Atari 2600环境中达到了与大规模“教师网络”一致的学习效果.模型代码公布在https://github.com/meadewaking/Compact_A3C.Asynchronous advantage actor-critic(A3C)constructs a parallel deep reinforcement learning framework composed by one-Learner and multi-Workers.However,A3C produces the high variance solutions,and Learner does not obtain the global optimal policy.Moreover,it is difficult to transfer and deploy from the large-scale parallel network to the low consumption end-platform.Aims to above problems,we propose a compression and knowledge extraction model based on supervised exploring,called Compactt_A3C.In the proposed model,we freeze Workers of the pre-trained A3C to measure these performances in the common state,and map the performances to probabilities by softmax.In this paper,we update Learner according to such probability,which is to obtain the global optimal sub-model(Worker)and enhance resource utilization.Furthermore,the updated Learner is assigned as Teacher Network to supervise Student Network in the early exploration stage.We exploit the linear factor to reduce the guidance of Teacher Network for encouraging the free exploration of Student Network.And building up two types of Student Network to demonstrate the effectiveness aims at the proposed model.In the popular states including Gym Classic Control and Atari 2600,the level of Teacher Network is achieved.The code of proposed model is published in https://github.com/meadewaking/Compact_A3C.
关 键 词:强化学习 深度强化学习 演员评论家模型 异步优势演员评论家模型 模型压缩
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.30