检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:庄述鑫 陈永红 郝一行 吴巍炜 徐学永 王万元 ZHUANG Shu-xin;CHEN Yong-hong;HAO Yi-hang;WU Wei-wei;XU Xue-yong;WANG Wan-yuan(School of Computer Science and Engineering,Southeast University,Nanjing 211189;Shenyang Aeroengine Design and Research Institute,Yangzhou Collaborative Innovation Research Institute Co.,Ltd.,Yangzhou 210016;Nanjing North Information Industrialization Group Co.,Ltd.,Nanjing 211189,China)
机构地区:[1]东南大学计算机科学与工程学院,江苏南京211189 [2]沈阳飞机设计研究所扬州协同创新研究院有限公司,江苏扬州210016 [3]北方信息控制研究院集团有限公司,江苏南京211189
出 处:《计算机工程与科学》2024年第6期1081-1091,共11页Computer Engineering & Science
摘 要:在对抗博弈环境中,目标智能体希望生成具有高鲁棒性的博弈策略,使得目标智能体在面对不同对手策略时,始终具有较高的收益。现有的基于自我博弈的策略生成方法通常会过拟合到针对对手某个特定策略进行学习,所学习到的策略鲁棒性低且容易受到其他对手策略的攻击。此外,现有的结合深度强化学习和博弈论方法迭代生成对手策略的方法在复杂且具有庞大决策空间的对抗场景下收敛效率低。鉴于此,提出一种基于种群多样性的鲁棒策略生成方法,其中对抗双方各自维护一个种群策略池,并且需要保证种群中的策略是具有多样性的,以此生成鲁棒的目标策略。为了保证种群多样性,将从策略的行为和质量2个视角度量策略的多样性,其中行为多样性是指不同策略状态-动作轨迹的差异性,质量多样性是指不同策略面对相同对手时最终获得的收益的差异性。最后,在典型的具有连续状态、连续动作的对抗环境中验证了所提出的基于种群多样性所生成的策略的鲁棒性。In adversarial game environments,the objective agent aims to generate robust game policies,ensuring high returns when facing different opponent policies consistently.Existing self-play-based policy generation methods often overfit to learning against a specific opponent policy,resulting in low robustness and vulnerability to attacks from other opponent policies.Additionally,existing methods that combine deep rein-forcement learning and game theory to iteratively generate opponent policies have low convergence efficiency in complex adversarial scenarios with large decision spaces.To address these challenges,a population diversity-based robust policy generation method is proposed.In this method,both adversaries maintain a policy population pool,ensuring diversity within the population to generate a robust target policy.To ensure population diversity,policy diversity is measured from two perspectives:behavioral and quality diversity.Behavioral diversity refers to the differences in state-action trajectories of different policies,while quality diversity refers to the differences in the returns obtained when facing the same opponent.Finally,the robustness of the policies generated based on population diversity is validated in typical adversarial environments with continuous stateaction spaces.
关 键 词:对抗环境 深度强化学习 种群多样性 Shapley value 行为表征
分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49