融合上下文感知注意力的Transformer目标跟踪方法  

Context-aware attention fused Transformer tracking

在线阅读下载全文

作  者:徐晗 董仕豪 张家伟 郑钰辉[1] Xu Han;Dong Shihao;Zhang Jiawei;Zheng Yuhui(School of Computer Science,Nanjing University of Information Science and Technology,Nanjing 210044,China)

机构地区:[1]南京信息工程大学计算机学院,南京210044

出  处:《中国图象图形学报》2025年第1期212-224,共13页Journal of Image and Graphics

基  金:国家自然科学基金项目(U20B2065);江苏省自然科学基金项目(BK20211539)。

摘  要:目的 近年来,Transformer跟踪器取得突破性的进展,其中自注意力机制发挥了重要作用。当前,自注意力机制中独立关联计算易导致权重不明显现象,限制了跟踪方法性能。为此,提出了一种融合上下文感知注意力的Transformer目标跟踪方法。方法 首先,引入SwinTransformer(hierarchical vision Transformer using shifted windows)以提取视觉特征,利用跨尺度策略整合深层与浅层的特征信息,提高网络对复杂场景中目标表征能力。其次,构建了基于上下文感知注意力的编解码器,充分融合模板特征和搜索特征。上下文感知注意力使用嵌套注意计算,加入分配权重的目标掩码,可有效抑制由相关性计算不准确导致的噪声。最后,使用角点预测头估计目标边界框,通过相似度分数结果更新模板图像。结果 在TrackingNet(large-scale object tracking dataset)、LaSOT(large-scale single object tracking)和GOT-10K(generic object tracking benchmark)等多个公开数据集上开展大量测试,本文方法均取得了优异性能。在GOT-10K上平均重叠率达到73.9%,在所有对比方法中排在第1位;在LaSOT上的AUC(area under curve)得分和精准度为0.687、0.749,与性能第2的ToMP(transforming model prediction for tracking)相比分别提高了1.1%和2.7%;在TrackingNet上的AUC得分和精准度为0.831、0.807,较第2名分别高出0.8%和0.3%。结论 所提方法利用上下文感知注意力聚焦特征序列中的目标信息,提高了向量交互的精确性,可有效应对快速运动、相似物干扰等问题,提升了跟踪性能。Objective Visual target tracking,as one of the key tasks in the field of computer vision,is mainly aimed at pre⁃dicting the size and position of a target in a given video sequence.In recent years,target tracking has been widely used in the fields of autonomous driving,unmanned aerial vehicles(UAVs),military activities,and intelligent surveillance.Although numerous excellent methods have emerged in the field of target tracking,multifaceted challenges remain,includ⁃ing,but not limited to,shape variations,occlusion,motion blurring,and interference from proximate objects.Currently,target tracking methods are categorized into two main groups:correlation-based filtering and deep learning based.The for⁃mer approximates the target tracking process as a search image signal domain computation.However,fully utilizing image representation information by using manual features is difficult,which greatly limits the performance of tracking methods.In recent years,deep learning has made significant progress in the field of target tracking by virtue of its powerful visual representation processing capabilities. In recent years, Transformer trackers have made breakthroughs, in which the selfattentionmechanism plays an important role. Currently, the independent correlation calculation in the self-attention mecha⁃nism is prone to lead to the phenomenon of ambiguous weights, thus hampering the tracking method’s overall performance.For this reason, a Transformer target tracking method incorporating context-aware attention is proposed. Method First,hierarchical vision Transformer using shifted windows (SwinTransformer) is introduced to extract visual features, and across-scale strategy is utilized to integrate deep and shallow feature information to improve the network’s ability to charac⁃terize targets in complex scenes. The cross-scale fusion strategy is used to obtain key information at different scales, cap⁃ture templates, and search for image diversity texture features, which helps the tracking network better unders

关 键 词:计算机视觉 目标跟踪 上下文感知注意力 TRANSFORMER 特征融合 

分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象