预测资源分配::马尔可夫决策过程的无监督学习

Predictive resource allocation:unsupervised learning of Markov decision processes

作　　者：吴佳骏赵剑羽孙乘坚杨晨阳[1] Jiajun WU;Jianyu ZHAO;Chengjian SUN;Chenyang YANG(School of Electronics and Information Engineering,Beihang University,Beijing 100191,China)

机构地区：[1]北京航空航天大学电子信息工程学院,北京100191

出　　处：《中国科学：信息科学》2024年第8期1983-2000,共18页Scientia Sinica(Informationis)

基　　金：国家重点研发计划(批准号:2022YFB2902002);国家自然科学基金重点项目(批准号:61731002);国家自然科学基金面上项目(批准号:62271024)资助。

摘　　要：当已知未来的移动轨迹等信息时,面向视频点播业务的预测资源分配可以在满足用户体验的前提下降低基站能耗或提高网络吞吐量传统的预测资源分配方法采用先预测用户轨迹等信息再优化功率等资源分配的方法,在预测窗较长时预测误差大,导致预测所带来的增益降低.为了解决这个问题,近期已有文献把预测资源分配建模为马尔可夫决策过程,采用深度强化学习进行在线决策.然而,对于这类适于采用强化学习的马尔可夫决策过程,现有文献往往以试错的方式对状态进行设计.此外,对于有约束的优化问题,现有利用强化学习解决无线问题的方法大多通过在奖励函数上加入包含需要手动调节超参数的惩罚项满足约束.本文以移动用户视频播放不卡顿约束下使基站发射能耗最小的问题为例,提出在线求解预测资源分配的无监督深度学习方法对信息预测和资源分配进行联合优化,并建立这种方法与深度强化学习的联系.所提出的方法可以通过在线端到端无监督深度学习提高预测资源分配的性能,能以系统化而非试错式的方式设计状态,可以自动而非通过引入超参来满足复杂的约束.仿真结果表明,所提出的在线无监督深度学习与深度强化学习所达到的发射能耗相近,但能够简化状态的设计,验证了理论分析结果.When future information of a mobile user such as trajectory is known,predictive resource allocation for video on-demand service can reduce energy consumption of base station or increase network throughput with ensured user experience.Traditional methods for predictive resource allocation first predict user information(say trajectory)and then optimize resource(say power)allocation.However,the prediction accuracy degrades as the prediction horizon increases.To deal with this issue,several recent works employed deep reinforcement learning for online decision-making by formulating the predictive resource allocation problem as Markov decision process(MDP).However,for this kind of MDP problems that is appropriately solved by reinforcement learning,existing works design the state in a trial-and-eror manner.For constrained optimization problems,most existing reinforcement learning methods for wireless problems add penalty terms to the reward function with manually adjustable hyper-parameters to satisfy the constraints.This paper proposes an unsupervised deep learning method for online predictive resource allocation in an end-to-end manner,which can jointly predict information and optimize resource allocation.The proposed method is able to improve the performance of predictive resource allocation by online end-to-end unsupervised deep learning,and can systematically design the state of MDP and satisfy complex constraints such that the tedious trial-and-error methods for designing state and satisfying constraints are no longer necessary.We analyze the relationship between the unsupervised deep learning and deep reinforcement learning.Simulation results show that the proposed method needs almost the same energy consumption as deep reinforcement learning with a simplified state design process,which verifies the theoretical analysis.

关键词：预测资源分配马尔可夫决策过程无监督深度学习深度强化学习状态设计复杂约束

分类号：TP18[自动化与计算机技术—控制理论与控制工程] O211.62[自动化与计算机技术—控制科学与工程] TN929.5[理学—概率论与数理统计]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

预测资源分配::马尔可夫决策过程的无监督学习

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

预测资源分配::马尔可夫决策过程的无监督学习

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索