多帧时空注意力引导的半监督视频分割

Multiframe spatiotemporal attention-guided semisupervised video segmentation

作　　者：罗思涵袁夏[1] 梁永顺[2] Luo Sihan;Yuan Xia;Liang Yongshun(School of Computer Science and Engineering,Nanjing University of Science and Technology,Nanjing 210094,China;School of Mathematics and Statistics,Nanjing University of Science and Technology,Nanjing 210094,China)

机构地区：[1]南京理工大学计算机科学与工程学院,南京210094 [2]南京理工大学数学与统计学院,南京210094

出　　处：《中国图象图形学报》2024年第5期1233-1251,共19页Journal of Image and Graphics

基　　金：国家自然科学基金项目(12071218)。

摘　　要：目的传统的半监督视频分割多是基于光流的方法建模关键帧与当前帧之间的特征关联。而光流法在使用过程中容易因遮挡、特殊纹理等情况产生错误,从而导致多帧融合存在问题。为了更好地融合多帧特征,本文提取第1帧的外观特征信息与邻近关键帧的位置信息,通过Transformer和改进的PAN(path aggregation network)模块进行特征融合,从而基于多帧时空注意力学习并融合多帧的特征。方法多帧时空注意力引导的半监督视频分割方法由视频预处理(即外观特征提取网络和当前帧特征提取网络)以及基于Transformer和改进的PAN模块的特征融合两部分构成。具体包括以下步骤:构建一个外观信息特征提取网络,用于提取第1帧图像的外观信息;构建一个当前帧特征提取网络,通过Transformer模块对当前帧与第1帧的特征进行融合,使用第1帧的外观信息指导当前帧特征信息的提取;借助邻近数帧掩码图与当前帧特征图进行局部特征匹配,决策出与当前帧位置信息相关性较大的数帧作为邻近关键帧,用来指导当前帧位置信息的提取;借助改进的PAN特征聚合模块,将深层语义信息与浅层语义信息进行融合。结果本文算法在DAVIS(densely annotated video segmentation)-2016数据集上的J和F得分为81.5%和80.9%,在DAVIS-2017数据集上为78.4%和77.9%,均优于对比方法。本文算法的运行速度为22帧/s,对比实验中排名第2,比PLM(pixel-level matching)算法低1.6%。在YouTube-VOS(video object segmentation)数据集上也取得了有竞争力的结果,J和F的平均值达到了71.2%,领先于对比方法。结论多帧时空注意力引导的半监督视频分割算法在对目标物体进行分割的同时,能有效融合全局与局部信息,减少细节信息丢失,在保持较高效率的同时能有效提高半监督视频分割的准确率。Objective Video object segmentation(VOS)aims to provide high-quality segmentation of target object instances throughout an input video sequence,obtaining pixel-level masks of the target objects,thereby finely segmenting the target from the background images.Compared with tasks such as object tracking and detection,which involve bounding-box level tasks(using rectangular frames to select targets),VOS has pixel-level accuracy,which is more conducive to locating the target accurately and outlining the details of the target’s edge. Depending on the supervision informa⁃tion provided, VOS can be divided into three scenarios: semisupervised VOS, interactive VOS, and unsupervised VOS. Inthis study, we focus on the semisupervised task. In the scenario of semisupervised VOS, pixel-level annotated masks of thefirst frame of the video are provided, and subsequent prediction frames can fully utilize the annotated mask of the first frameto assist in computing the segmentation results of each prediction frame. With the development of deep neural network tech⁃nology, current semisupervised VOS methods are mostly based on deep learning. These methods can be divided into the fol⁃lowing three categories: detection-, matching-, and propagation-based methods. Detection-based object segmentation algo⁃rithms treat VOS tasks as image object segmentation tasks without considering the temporal association of videos, believingthat only a strong frame-level object detector and segmenter are needed to perform target segmentation frame by frame.Matching-based works typically segment video objects by calculating pixel-level matching scores or semantic feature match⁃ing scores between the template frame and the current prediction frame. Propagation-based methods propagate the multi⁃frame feature information before the prediction frame to the prediction frame and calculate the correlation between the pre⁃diction frame feature and the previous frame feature to represent video context information. This context information locatesth

关键词：视频目标分割(VOS) 特征提取网络外观特征信息时空注意力特征聚合

分类号：TP391.4[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

多帧时空注意力引导的半监督视频分割

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

多帧时空注意力引导的半监督视频分割

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索