Masked Autoencoders as Single Object Tracking Learners  被引量:1

在线阅读下载全文

作  者:Chunjuan Bo XinChen Junxing Zhang 

机构地区:[1]School of Information and Communication Engineering,Dalian Minzu University,Dalian,116600,China [2]School of Information and Communication Engineering,Dalian University of Technology,Dalian,116024,China

出  处:《Computers, Materials & Continua》2024年第7期1105-1122,共18页计算机、材料和连续体(英文)

基  金:supported in part by National Natural Science Foundation of China(No.62176041);in part by Excellent Science and Technique Talent Foundation of Dalian(No.2022RY21).

摘  要:Significant advancements have beenwitnessed in visual tracking applications leveragingViT in recent years,mainly due to the formidablemodeling capabilities of Vision Transformer(ViT).However,the strong performance of such trackers heavily relies on ViT models pretrained for long periods,limitingmore flexible model designs for tracking tasks.To address this issue,we propose an efficient unsupervised ViT pretraining method for the tracking task based on masked autoencoders,called TrackMAE.During pretraining,we employ two shared-parameter ViTs,serving as the appearance encoder and motion encoder,respectively.The appearance encoder encodes randomly masked image data,while the motion encoder encodes randomly masked pairs of video frames.Subsequently,an appearance decoder and a motion decoder separately reconstruct the original image data and video frame data at the pixel level.In this way,ViT learns to understand both the appearance of images and the motion between video frames simultaneously.Experimental results demonstrate that ViT-Base and ViT-Large models,pretrained with TrackMAE and combined with a simple tracking head,achieve state-of-the-art(SOTA)performance without additional design.Moreover,compared to the currently popular MAE pretraining methods,TrackMAE consumes only 1/5 of the training time,which will facilitate the customization of diverse models for tracking.For instance,we additionally customize a lightweight ViT-XS,which achieves SOTA efficient tracking performance.

关 键 词:Visual object tracking vision transformer masked autoencoder visual representation learning 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象