时空特征融合网络的多目标跟踪与分割  被引量:1

Spatiotemporal feature fusion network based multi-objects tracking and segmentation

在线阅读下载全文

作  者:刘雨亭 张开华[1] 樊佳庆 刘青山[1] Liu Yuting;Zhang Kaihua;Fan Jiaqing;Liu Qingshan(Engineering Research Center of Digital Forensics,Ministry of Education,Nanjing University of Information Scienceand Technology,Nanjing 210044,China)

机构地区:[1]南京信息工程大学数字取证教育部工程研究中心,南京210044

出  处:《中国图象图形学报》2022年第11期3257-3266,共10页Journal of Image and Graphics

基  金:科技创新2030-“新一代人工智能”重大项目(2018AAA0100400);国家自然科学基金项目(61876088,U20B2065,61532009)。

摘  要:目的 多目标跟踪与分割是计算机视觉领域一个重要的研究方向。现有方法多是借鉴多目标跟踪领域先检测然后进行跟踪与分割的思路,这类方法对重要特征信息的关注不足,难以处理目标遮挡等问题。为了解决上述问题,本文提出一种基于时空特征融合的多目标跟踪与分割模型,利用空间三坐标注意力模块和时间压缩自注意力模块选择出显著特征,以此达到优异的多目标跟踪与分割性能。方法 本文网络由2D编码器和3D解码器构成,首先将多幅连续帧图像输入到2D编码层,提取出不同分辨率的图像特征,然后从低分辨率的特征开始通过空间三坐标注意力模块得到重要的空间特征,通过时间压缩自注意力模块获得含有关键帧信息的时间特征,再将两者与原始特征融合,然后与较高分辨率的特征共同输入3D卷积层,反复聚合不同层次的特征,以此得到融合多次的既有关键时间信息又有重要空间信息的特征,最后得到跟踪和分割结果。结果 实验在YouTube-VIS(YouTube video instance segmentation)和KITTI MOTS(multi-object tracking and segmentation)两个数据集上进行定量评估。在YouTube-VIS数据集中,相比于性能第2的CompFeat模型,本文方法的AP(average precision)值提高了0.2%。在KITTI MOTS数据集中,相比于性能第2的STEm-Seg模型,在汽车类上,本文方法的ID switch指标减少了9;在行人类上,本文方法的sMOTSA(soft multi-object tracking and segmentation accuracy)、MOTSA(multi-object tracking and segmentation accuracy)和MOTSP(multi-object tracking and segmentation precision)分别提高了0.7%、0.6%和0.9%,ID switch指标减少了1。在KITTI MOTS数据集中进行消融实验,验证空间三坐标注意力模块和时间压缩自注意力模块的有效性,消融实验结果表明提出的算法改善了多目标跟踪与分割的效果。结论 提出的多目标跟踪与分割模型充分挖掘多帧图像之间的特征信息,使多Objective multiple-objects-oriented tracking and segmentation aims to track and segment a variety of video-based objects, which is concerned about detection, tracking and segmentation. Such existing methods are derived of tracking and segmenting in the context of multi-objects tracking detection. But, it is challenged to resolve the target occlusion and its contexts for effective features extraction. Our research is focused on a joint multi-object tracking and segmentation method based on the 3 D spatiotemporal feature fusion module(STFNet), spatial tri-coordinated attention(STCA) and temporal-reduced self-attention(TRSA), which is adaptively for salient feature representations selection to optimize tracking and segmentation performance. Method The STFNet is composed of a 2 D encoder and a 3 D decoder. First, multiple frames are put into the 2 D encoder in consistency and the decoder takes low-resolution features as input. The low-resolution features is implemented for feature fusion through 3 layers of 3 D convolutional layers, the spatial features of the key spatial information is then obtained via STCA module, and the key-frame-information-involved temporal features are integrated through the TRSA. They are all merged with the original features. Next, the higher resolution features and the low-level fusion features are put into the 3 D convolutional layer(1 × 1 × 1) together, and the features of different levels are replicable to aggregate the features with key frame information and salient spatial information. Finally, our STFNet is fitted to the features into the three-dimensional Gaussian distribution of each case. Every Gaussian distribution is assigned different pixels on continuous frames for multi-scenario objects or their background. It can achieve the segmentation of the each target. Specifically, STCA is focused on the attention-enhanced version of the coordinated attention. The coordinated attention is based on the horizontal and vertical attention weights only. The attention mechanism can be link

关 键 词:深度学习 多目标跟踪与分割(MOTS) 3D卷积神经网络 特征融合 注意力机制 

分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象