离散四水库问题基准下基于n步Q-learning的水库群优化调度  被引量:4

Optimal scheduling of multi-reservoir system based on n-step Q-learning under discrete four-reservoir problem benchmark

在线阅读下载全文

作  者:胡鹤轩[1,2,3] 钱泽宇 胡强 张晔[1,2] HU Hexuan;QIAN Zeyu;HU Qiang;ZHANG Ye(Key Laboratory of Water Big Data Technology of Ministry of Water Resources,Hohai University,Nanjing 210098,China;College of Computer and Information,Hohai University,Nanjing 210098,China;School of Electrical Engineering,Tibet Agriculture&Animal Husbandry University,Nyingchi 860000,China)

机构地区:[1]河海大学水利部水利大数据重点实验室,江苏南京210098 [2]河海大学信息学部计算机与信息学院,江苏南京210098 [3]西藏农牧学院电气工程学院,西藏林芝860000

出  处:《中国水利水电科学研究院学报(中英文)》2023年第2期138-147,共10页Journal of China Institute of Water Resources and Hydropower Research

基  金:国家重点研发计划项目(2018YFC0407904)。

摘  要:水库优化调度问题是一个具有马尔可夫性的优化问题。强化学习是目前解决马尔可夫决策过程问题的研究热点,其在解决单个水库优化调度问题上表现优异,但水库群系统的复杂性为强化学习的应用带来困难。针对复杂的水库群优化调度问题,提出一种离散四水库问题基准下基于n步Q-learning的水库群优化调度方法。该算法基于n步Q-learning算法,对离散四水库问题基准构建一种水库群优化调度的强化学习模型,通过探索经验优化,最终生成水库群最优调度方案。试验分析结果表明,当有足够的探索经验进行学习时,结合惩罚函数的一步Q-learning算法能够达到理论上的最优解。用可行方向法取代惩罚函数实现约束,依据离散四水库问题基准约束建立时刻可行状态表和时刻状态可选动作哈希表,有效的对状态动作空间进行降维,使算法大幅度缩短优化时间。不同的探索策略决定探索经验的有效性,从而决定优化效率,尤其对于复杂的水库群优化调度问题,提出了一种改进的ε-greedy策略,并与传统的ε-greedy、置信区间上限UCB、Boltzmann探索三种策略进行对比,验证了其有效性,在其基础上引入n步回报改进为n步Q-learning,确定合适的n步和学习率等超参数,进一步改进算法优化效率。The reservoir optimal scheduling problem is a classical optimization problem with Markovian nature.Reinforcement learning is a current research hotspot for solving Markovian decision process problems,and it excels in solving individual reservoir optimal scheduling problems,but the complexity of reservoir swarm systems poses difficulties for the application of reinforcement learning.An n-step Q-learning-based reservoir group optimal scheduling method is proposed for the complex reservoir group optimal scheduling problem under the discrete four-reservoir problem benchmark.The algorithm is based on the n-step Q-learning algorithm to construct a reinforcement learning model for optimal scheduling of reservoir clusters for the discrete four-reservoir problem benchmark,and finally generates the optimal scheduling scheme for reservoir clusters by exploring empirical optimization.The experimental analysis results show that the one-step Q-learning algorithm combined with the penalty function can achieve the theoretically optimal solution when sufficient exploration experience is available for learning.The feasible direction method is used to replace the penalty function to achieve the constraint,and the moment feasible state table and state feasible action hash table are established based on the benchmark constraint of the discrete four reservoir problem,which effectively reduce the dimensionality of the state action space and enables the algorithm to significantly shorten the optimization time.Different exploration strategies determine the effectiveness of the exploration experience and thus the optimization efficiency,especially for complex reservoir group optimization scheduling problems,so we propose an improvedε-greedy strategy and compare it with the traditionalε-greedy,Upper Confidence Bound,and Boltzmann three strategies to verify its effectiveness.The n-step payoff improvement is introduced on its basis as n-step Q-learning,which determines the appropriate n-steps and hyperparameters such as learning rate,furth

关 键 词:水库优化调度 强化学习 Q学习 惩罚函数 可行方向法 

分 类 号:TV697.1[水利工程—水利水电工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象