机构地区:[1]南京大学计算机软件新技术国家重点实验室,南京210023
出 处:《中国科学:技术科学》2023年第4期547-564,共18页Scientia Sinica(Technologica)
基 金:2018年度科技创新2030-“新一代人工智能”重大项目(编号:2018AAA0102302);南京大学软件新技术与产业化协同创新中心项目资助。
摘 要:新一代无人机群系统的重要特征是具有群体智能,是一类典型的群智激发汇聚系统.目前,多智能体强化学习技术展现出较强优势,是构建新一代自主智能无人机群系统的重要方法.但多智能体强化学习的训练过程尚处于“黑盒”状态,缺乏对群体智能的激发和汇聚程度的有效度量手段.针对这一问题,从多智能体强化学习中智能体的策略出发,以策略多样性度量无人机群在多智能体强化学习的训练过程中的激发-汇聚程度.为了对策略的多样性进行度量,借鉴物种多样性和信息论中的相关概念,明确了策略多样性的内涵包括丰富和均匀程度两方面,提出了“策略距离二次熵”和“动作分布信息熵”这两种策略多样性的计算方法.设计了无人机群突防场景对本文所提出的策略多样性指标和两种计算方法的有效性和有用性进行了验证,并通过敏感程度分析对两种计算方法进行了对比.实验结果表明这两种计算方法在该场景中均能有效区分策略多样性的变化,且两种计算方法间具有一致性,从而验证了策略多样性指标及其计算方法的有效性.在有用性方面,验证了策略多样性与奖赏之间的关联关系,以及环境的动态改变与策略多样性之间的相互影响和关联关系,体现了策略多样性在认知群智系统,指导群智激发汇聚过程上的潜在有用性.提出的策略多样性及其计算方法,能够为定量认知群智系统激发-汇聚程度,进而对群智系统的学习和训练开展引导和干预提供方法支撑.Collective intelligence is a key feature of the new generation unmanned aerial vehicle(UAV)swarm system,which is a typical collective intelligence activation-convergence system.At present,multi-agent reinforcement learning shows great advantages and has become a key research direction in related fields to build the new generation of autonomous intelligent UAV swarm systems.However,the current multi-agent reinforcement learning training process is a“black box”,and there are no effective measures of the collective intelligence's activation-convergence.In response to this issue,this paper begins with the policies in multi-agent reinforcement learning and uses the diversity of the policies to measure the degree of activation and convergence of the UAVs during the multi-agent reinforcement learning training process.To measure the diversity of policies,this paper clarifies that the connotations of the diversity of policies include richness and evenness,which is inspired by species diversity and information theory.Two methods to measure the diversity of policies-“Quadratic Entropy of Policy Distance”and“Information Entropy of Action Distribution”-are proposed based on quadratic and information entropies.This paper develops multi-UAV autonomous navigation in air penetration scenarios to test the validity and utility of these two methods and compares them using sensitivity analysis.The experimental results show that these two measurements can represent the difference between the diversity of policies in this scenario,have distinguishing ability,and are convergent,proving the measurements'validity.This paper verifies the relationship between policy diversity and rewards as well as the interaction and correlation between policy diversity and dynamic change in the environment,demonstrating the potential utility of the measurements.As a result,this paper provides the diversity of policies and the computation methods,which provide support for acknowledging the degree of activation and convergence in collective inte
关 键 词:群智激发-汇聚度量 策略多样性 策略距离二次熵 动作分布信息熵 无人机群导航
分 类 号:V279[航空宇航科学与技术—飞行器设计] V249[自动化与计算机技术—控制理论与控制工程] TP18[自动化与计算机技术—控制科学与工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...