分布式在线鞍点问题的Bandit反馈优化算法

Bandit Feedback Optimization Algorithm for Distributed Online Saddle Point Problem

作　　者：张文韬张保勇[1] 袁德明[1] 徐胜元[1] ZHANG Wen-Tao;ZHANG Bao-Yong;YUAN De-Ming;XU Sheng-Yuan(School of Automation,Nanjing University of Science and Technology,Nanjing 210094)

机构地区：[1]南京理工大学自动化学院,南京210094

出　　处：《自动化学报》2025年第4期857-874,共18页Acta Automatica Sinica

基　　金：国家自然科学基金(62273181,62373190,62221004)资助。

摘　　要：本文研究了多智能体时变网络上基于Bandit反馈的分布式在线鞍点问题,其中每个智能体通过本地计算和局部信息交流去协作最小化全局损失函数.在Bandit反馈下,包括梯度在内的损失函数信息是不可用的,每个智能体仅能获得和使用在某决策或其附近产生的函数值.为此,结合单点梯度估计方法和预测映射技术,提出一种非欧几里得意义上的分布式在线Bandit鞍点优化算法.以动态鞍点遗憾作为性能指标,对于一般的凸−凹损失函数,建立了遗憾上界并在某些预设条件下确保所提算法的次线性收敛.此外,考虑到在迭代优化中计算优化子程序的精确解通常较为困难,进一步扩展一种基于近似计算方法的算法变种,并严格分析精确度设置对扩展算法遗憾上界的影响.最后,通过一个目标跟踪案例对算法的有效性和先进性进行仿真验证.This paper studies the distributed online saddle point problem with Bandit feedback over a multi-agent time-varying network,in which each agent collaborates to minimize the global loss function through local calculation and local information exchange.Under Bandit feedback,loss function information including gradients is not available,and each agent can only obtain and use the function value generated by a decision or decisions near it.To this end,a distributed online Bandit saddle point optimization algorithm in a non-Euclidean sense is proposed by combining the one-point gradient estimation method and the predictive mapping technique.Taking the expected dynamic saddle point regret as the performance metric,we establish the related regret upper bound for the general convex-concave loss functions and ensure that the proposed algorithm converges sublinearly under certain preconditions.In addition,considering that computing the exact solutions of the optimization oracles is usually difficult in iterative optimization,this paper further expands an algorithm variant based on an approximate computation method,and rigorously analyzes the impact of precision settings on the regret upper bound of the expanded algorithm.Finally,the effectiveness and advancement of the proposed algorithms are verified through a simulation example of target tracking.

关键词：BANDIT 反馈分布式优化在线鞍点问题镜面下降动态鞍点遗憾

分类号：TP18[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

分布式在线鞍点问题的Bandit反馈优化算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

分布式在线鞍点问题的Bandit反馈优化算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索