非完全信息下基于PPO-CFR的扩展式博弈决策  

Extensive game decision based on the PPO-CFR algorithm under incomplete information

在线阅读下载全文

作  者:黄蕾 朱进 段福庆[2] Lei HUANG;Jin ZHU;Fuqing DUAN(College of Information Science and Technology,University of Science and Technology of China,Hefei 230022,Chinc;College of Artificial Intelligence,Beijing Nornal University,Beijing 100875,China)

机构地区:[1]中国科学技术大学信息科学技术学院,合肥230022 [2]北京师范大学人工智能学院,北京100875

出  处:《中国科学:信息科学》2022年第12期2178-2194,共17页Scientia Sinica(Informationis)

基  金:国家重点研发计划(批准号:2018AAA0100802);安徽省自然科学基金(批准号:2008085MF198)资助项目。

摘  要:非完全信息下的人机对抗通常可以通过双人零和博弈模型加以描述,反事实后悔最小化(counterfactual regret minimization,CFR)是处理非完全信息双人零和博弈的一种流行算法.然而现有CFR及其变体算法在迭代过程中使用固定的后悔值计算和策略更新类型,在非完全信息扩展式博弈下表现各有优劣,泛化性能薄弱.针对这一问题,本文将强化学习近端策略优化(proximal policy optimization,PPO)算法与CFR算法相结合,提出一种PPO-CFR算法,通过训练出理性的智能体,从而实现CFR迭代过程后悔值计算和策略更新类型的自适应选择,以提高算法的泛化性能,并实现非完全信息扩展式博弈的策略优化.本文采用通用的扑克博弈实验验证所提算法,并制定逐步奖励函数训练智能体的动作策略,实验结果表明,与现有方法相比,PPO-CFR算法具有更好的泛化性能和更低的可利用度,迭代策略更为逼近纳什均衡策略.Human-computer gaming under incomplete information is usually described by a two-player zero-sum game model. Counterfactual regret minimization(CFR) is a popular algorithm for two-player zero-sum games with incomplete information. However, the existing CFR and its variant algorithms use fixed regret calculation and strategy update type in the iteration process, which have their advantages and disadvantages in the incomplete information extensive game, and their generalization performance is weak. To solve this problem, this paper combines the proximal policy optimization(PPO) algorithm in reinforcement learning with the CFR algorithm to train rational agents to adaptively select appropriate regret calculation and strategy update types in the CFR iteration process to improve the generalization performance of the current CFR algorithms and realize the policy optimization of the incomplete information extensive game. In this paper, general poker game experiments are used to verify the proposed algorithm, and a stepwise reward function is formulated to train the action policy of the agent. Experimental results show that compared with existing state-of-the-art methods, the PPO-CFR algorithm has better generalization performance and lower exploitability, and the iteration policy is closer to the Nash equilibrium policy.

关 键 词:非完全信息 扩展式博弈 反事实后悔最小化 近端策略优化 博弈决策 

分 类 号:O225[理学—运筹学与控制论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象