检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:盛蕾 陈希亮 赖俊 SHENG Lei;CHEN Xiliang;LAI Jun(College of Command and Control Engineering,Army Engineering University of PLA,Nanjing 210007,China)
机构地区:[1]中国人民解放军陆军工程大学指挥控制工程学院,南京210007
出 处:《计算机科学与探索》2024年第8期2169-2179,共11页Journal of Frontiers of Computer Science and Technology
基 金:国家自然科学基金(61806221)。
摘 要:通过决策Transformer对基础模型进行离线预训练可以有效地解决在线多智能体强化学习采样效率低和可扩展性的问题,但这种生成预训练方法在个体奖励难以定义和数据集不能覆盖最优策略的多智能体任务中表现不佳。针对此问题,采用潜在状态分布改进决策Transformer,提出了一种融合离线预训练和在线微调的多智能体强化学习算法。该算法利用自编码器和独热编码方法生成离散的潜在状态表示,保留了原始状态空间中某些重要的信息;通过潜在的临时抽象改进生成式预训练的决策Transformer,类似于数据增益的技术,在一定程度上解决了未充分覆盖状态空间的离线数据集导致的外推误差问题;采用集中训练和分散执行的方式解决在线微调时智能体的信度分配问题;通过鼓励探索的多智能体策略梯度算法在下游任务中进一步探索协同策略。在星际争霸仿真平台上进行实验,与基线算法相比,在较少甚至没有离线轨迹数据的任务中得分更高,泛化能力更强。Offline pre-training of the basic model through decision Transformer can effectively solve the problems of low sampling efficiency and scalability of online multi-agent reinforcement learning,but this generative pre-training Transformer method performs poorly in multi-agent tasks where individual rewards are difficult to define and the dataset cannot cover the optimal strategy.To solve this problem,a multi-agent reinforcement learning algorithm integrating offline pre-training and online fine-tuning is proposed by using latent state distribution to improve the decision Transformer.The algorithm uses autoencoder and one-hot coding methods to generate discrete latent state representations,which retain some important information in the original state space.The decision Transformer of generative pre-training is improved through latent temporary abstraction,similar to the data gain technique,which solves the problem of extrapolation error caused by offline datasets that do not fully cover the state space to a certain extent.Centralized training and decentralized execution are used to solve the reliability distribution problem of agents during online fine-tuning.Through the multi-agent policy gradient algorithm that encourages exploration,the collaborative strategy is further explored in downstream tasks.Finally,experiments are carried out on the StarCraft simulation platform,and compared with the baseline algorithm,the scores are higher and the generalization ability is stronger in tasks with less or even no offline trajectory data.
关 键 词:离线多智能体强化学习 分布式学习 表示学习 大语言模型
分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.139.237.218