A3C深度强化学习模型压缩及知识抽取  被引量:2

A3C Deep Reinforcement Learning Model Compression and Knowledge Extraction

在线阅读下载全文

作  者:张晶[1] 王子铭 任永功[1] Zhang Jing;Wang Ziming;Ren Yonggong(School of Computer Science and Artificial Intelligence,Liaoning Normal University,Dalian,Liaoning 116081)

机构地区:[1]辽宁师范大学计算机与人工智能学院,辽宁大连116081

出  处:《计算机研究与发展》2023年第6期1373-1384,共12页Journal of Computer Research and Development

基  金:国家自然科学基金项目(61902165,61976109);大连市科技创新基金项目(2018J12GX047);大连市重点实验室专项。

摘  要:异步优势演员评论家(asynchronous advantage actor-critic,A3C)构建一主多从异步并行深度强化学习框架,其在最优策略探索中存在求解高方差问题,使主智能体难以保证全局最优参数更新及最佳策略学习.同时,利用百万计算资源构建的大规模并行网络,难以部署低功耗近端平台.针对上述问题,提出紧凑异步优势演员评论家(Compact_A3C)模型,实现模型压缩及知识抽取.该模型冻结并评价A3C框架中所有子智能体学习效果,将评价结果转化为主智能体更新概率,保证全局最优策略获取,提升大规模网络资源利用率.进一步,模型将优化主智能体作为“教师网络”,监督小规模“学生网络”前期探索与策略引导,并构建线性衰减损失函数鼓励“学生网络”对复杂环境自由探索,强化自主学习能力,实现大规模A3C模型知识抽取及网络压缩.建立不同压缩比“学生网络”,在流行Gym Classic Control与Atari 2600环境中达到了与大规模“教师网络”一致的学习效果.模型代码公布在https://github.com/meadewaking/Compact_A3C.Asynchronous advantage actor-critic(A3C)constructs a parallel deep reinforcement learning framework composed by one-Learner and multi-Workers.However,A3C produces the high variance solutions,and Learner does not obtain the global optimal policy.Moreover,it is difficult to transfer and deploy from the large-scale parallel network to the low consumption end-platform.Aims to above problems,we propose a compression and knowledge extraction model based on supervised exploring,called Compactt_A3C.In the proposed model,we freeze Workers of the pre-trained A3C to measure these performances in the common state,and map the performances to probabilities by softmax.In this paper,we update Learner according to such probability,which is to obtain the global optimal sub-model(Worker)and enhance resource utilization.Furthermore,the updated Learner is assigned as Teacher Network to supervise Student Network in the early exploration stage.We exploit the linear factor to reduce the guidance of Teacher Network for encouraging the free exploration of Student Network.And building up two types of Student Network to demonstrate the effectiveness aims at the proposed model.In the popular states including Gym Classic Control and Atari 2600,the level of Teacher Network is achieved.The code of proposed model is published in https://github.com/meadewaking/Compact_A3C.

关 键 词:强化学习 深度强化学习 演员评论家模型 异步优势演员评论家模型 模型压缩 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象