融合视觉词与自注意力机制的视频目标分割  被引量:2

Visual words and self-attention mechanism fusion based video object segmentation method

在线阅读下载全文

作  者:季传俊 陈亚当 车洵 Ji Chuanjun;Chen Yadang;Che Xun(School of Compuler Seience,Nanjing Universiy of Information Secience and Technology,Nanjing 210044,China;Engineering Research Center of Digial Forensics,Ministry of Education,Nanjing210044,China;Nanjing OpenX Technology Ca.,Lud.,Nanjing 210006,China)

机构地区:[1]南京信息工程大学计算机学院、软件学院、网络空间安全学院,南京210044 [2]数字取证教育部工程研究中心,南京210044 [3]南京众智维信息科技有限公司,南京210006

出  处:《中国图象图形学报》2022年第8期2444-2457,共14页Journal of Image and Graphics

基  金:国家自然科学基金项目(61802197)。

摘  要:目的 视频目标分割(video object segmentation, VOS)是在给定初始帧的目标掩码条件下,实现对整个视频序列中感兴趣对象的分割,但是视频中往往会出现目标形状不规则、背景中存在干扰信息和运动速度过快等情况,影响视频目标分割质量。对此,本文提出一种融合视觉词和自注意力机制的视频目标分割算法。方法 对于参考帧,首先将其图像输入编码器中,提取分辨率为原图像1/8的像素特征。然后将该特征输入由若干卷积核构成的嵌入空间中,并将其结果上采样至原始尺寸。最后结合参考帧的目标掩码信息,通过聚类算法对嵌入空间中的像素进行聚类分簇,形成用于表示目标对象的视觉词。对于目标帧,首先将其图像通过编码器并输入嵌入空间中,通过单词匹配操作用参考帧生成的视觉词来表示嵌入空间中的像素,并获得多个相似图。然后,对相似图应用自注意力机制捕获全局依赖关系,最后取通道方向上的最大值作为预测结果。为了解决目标对象的外观变化和视觉词失配的问题,提出在线更新机制和全局校正机制以进一步提高准确率。结果 实验结果表明,本文方法在视频目标分割数据集DAVIS(densely annotated video segmentation)2016和DAVIS 2017上取得了有竞争力的结果,区域相似度与轮廓精度之间的平均值J&F-mean(Jaccard and F-score mean)分别为83.2%和72.3%。结论 本文提出的算法可以有效地处理由遮挡、变形和视点变化等带来的干扰问题,实现高质量的视频目标分割。Objective Video object segmentation(VOS) involves foreground objects segmentation from the background in a video sequence. Its applications are relevant to video detection, video classification, video summarization, and self-driving. Our research is focused on a semi-supervised setting, which estimates the mask of the target object in the remaining frames of the video based on the target mask given in the initial frame. However, current video object segmentation algorithms are constrained of the issue of irregular shape, interference information and super-fast motion. Hence, our research develops a video object segmentation algorithm based on the integration of visual words and self-attention mechanism. Method For the reference frame, the reference frame image is first fed into the encoder to extract features of those resolutions are 1/8 of the original image. Subsequently, the extracted features are fed into the embedding space composed of several 3 × 3 convolution kernels, and the result is up-sampled to the original size. During the training process, the pixels from the same target in the embedding space are close to each other, while the pixels from different targets are far apart. Finally, the visual words representing the target objects are formed by combining the mask information annotated in the reference frames and clustering the pixels in the embedding space using a clustering algorithm. For the target frame, its image is first fed into the encoder and passed through the embedding space, and then a word matching operation is performed to represent the pixels in the embedding space with a certain number of visual words to obtain similarity maps. However, learning visual words is a challenging task because there is no real information about their corresponding object parts. Therefore, a meta-training algorithm is used to alternate between unsupervised learning of visual words and supervised learning of pixel classification given these visual words. The application of visual vocabulary allows for more rob

关 键 词:视频目标分割(VOS) 聚类算法 视觉词 自注意力机制 在线更新机制 全局校正机制 

分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象