对抗环境中基于种群多样性的鲁棒策略生成方法

A population diversity-based robust policy generation method in adversarial game environments

作　　者：庄述鑫陈永红郝一行吴巍炜徐学永王万元 ZHUANG Shu-xin;CHEN Yong-hong;HAO Yi-hang;WU Wei-wei;XU Xue-yong;WANG Wan-yuan(School of Computer Science and Engineering,Southeast University,Nanjing 211189;Shenyang Aeroengine Design and Research Institute,Yangzhou Collaborative Innovation Research Institute Co.,Ltd.,Yangzhou 210016;Nanjing North Information Industrialization Group Co.,Ltd.,Nanjing 211189,China)

机构地区：[1]东南大学计算机科学与工程学院,江苏南京211189 [2]沈阳飞机设计研究所扬州协同创新研究院有限公司,江苏扬州210016 [3]北方信息控制研究院集团有限公司,江苏南京211189

出　　处：《计算机工程与科学》2024年第6期1081-1091,共11页Computer Engineering & Science

摘　　要：在对抗博弈环境中,目标智能体希望生成具有高鲁棒性的博弈策略,使得目标智能体在面对不同对手策略时,始终具有较高的收益。现有的基于自我博弈的策略生成方法通常会过拟合到针对对手某个特定策略进行学习,所学习到的策略鲁棒性低且容易受到其他对手策略的攻击。此外,现有的结合深度强化学习和博弈论方法迭代生成对手策略的方法在复杂且具有庞大决策空间的对抗场景下收敛效率低。鉴于此,提出一种基于种群多样性的鲁棒策略生成方法,其中对抗双方各自维护一个种群策略池,并且需要保证种群中的策略是具有多样性的,以此生成鲁棒的目标策略。为了保证种群多样性,将从策略的行为和质量2个视角度量策略的多样性,其中行为多样性是指不同策略状态-动作轨迹的差异性,质量多样性是指不同策略面对相同对手时最终获得的收益的差异性。最后,在典型的具有连续状态、连续动作的对抗环境中验证了所提出的基于种群多样性所生成的策略的鲁棒性。In adversarial game environments,the objective agent aims to generate robust game policies,ensuring high returns when facing different opponent policies consistently.Existing self-play-based policy generation methods often overfit to learning against a specific opponent policy,resulting in low robustness and vulnerability to attacks from other opponent policies.Additionally,existing methods that combine deep rein-forcement learning and game theory to iteratively generate opponent policies have low convergence efficiency in complex adversarial scenarios with large decision spaces.To address these challenges,a population diversity-based robust policy generation method is proposed.In this method,both adversaries maintain a policy population pool,ensuring diversity within the population to generate a robust target policy.To ensure population diversity,policy diversity is measured from two perspectives:behavioral and quality diversity.Behavioral diversity refers to the differences in state-action trajectories of different policies,while quality diversity refers to the differences in the returns obtained when facing the same opponent.Finally,the robustness of the policies generated based on population diversity is validated in typical adversarial environments with continuous stateaction spaces.

关键词：对抗环境深度强化学习种群多样性 Shapley value 行为表征

分类号：TP181[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

对抗环境中基于种群多样性的鲁棒策略生成方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

对抗环境中基于种群多样性的鲁棒策略生成方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索