结合关键帧提取的视频-文本跨模态实体分辨双重编码方法被引量：5

Dual Encoding Integrating Key Frame Extraction for Video-text Cross-modal Entity Resolution

作　　者：曾志贤曹建军翁年凤蒋国权范强[1,2] ZENG Zhixian;CAO Jianjun;WENG Nianfeng;JIANG Guoquan;FAN Qiang(College of Computer Science and Technology,National University of Defense Technology,Changsha 410003,Hunan,China;The 63rd Research Institute,National University of Defense Technology,Nanjing 210007,Jiangsu,China)

机构地区：[1]国防科技大学计算机学院,湖南长沙410003 [2]国防科技大学第六十三研究所,江苏南京210007

出　　处：《兵工学报》2022年第5期1107-1116,共10页Acta Armamentarii

基　　金：国家自然科学基金项目(61371196);中国博士后科学基金特别资助项目(2015M582832);国家重大科技专项项目(2015ZX01040-201)。

摘　　要：现有的视频-文本跨模态实体分辨方法在视频处理上均采用均匀取帧的方法,必然导致视频信息的丢失,增加问题的复杂度。针对这一问题,提出一种结合关键帧提取的视频-文本跨模态实体分辨双重编码方法(DEIKFE)。以充分保留视频信息表征为前提,设计关键帧提取算法提取视频中的关键帧,获得视频关键帧集合表示。对于视频关键帧集合和文本,采用多级编码的方法,分别提取表征视频和文本的全局、局部和时序的特征,将其进行拼接形成多级编码表示。将该编码表示映射至共同嵌入空间,采用强负样本跨模态三元组损失对模型参数进行优化,使得匹配的视频-文本相似度越大,而不匹配的视频-文本相似度越小。通过在MSR-VTT、VATEX两个数据集上进行实验验证,与现有方法进行对比,在总体性能R@sum上分别提升了9.22%、2.86%,证明了该方法的优越性。Existing video-text cross-modal entity resolution methods all adopt a method of uniformly extracting frames in video processing,which inevitably leads to the loss of video information and increases the model complexity.A dual encoding integrating key frame extraction(DEIKFE)is proposed for video-text cross-modal entity resolution.On the premise of fully retaining the video information,a key frame extraction algorithm is designed to extract the key frames in the video,which makes up the video key frame set.For the video key frame set and the text,a multi-level encoding method is adopted to extract the global,local,and time-series features,which are spliced to form a multi-level encoding representation.And the encoding representation is mapped into a common embedding space,and the model parameters are optimized by cross-modal triplet ranking loss based on the hard negative sample to make the matched video-text similarity greater and the unmatched video-text similarity smaller.The experiments on MSR-VTT and VATEX datasets show that the overall performance of R@sum is increased by 9.22%and 2.86%,respectively,comparedwith the existing methods,which can fully demonstrate the superiority of the proposed method.

关键词：跨模态实体分辨关键帧提取共同嵌入空间双重编码强负样本

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

结合关键帧提取的视频-文本跨模态实体分辨双重编码方法被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

结合关键帧提取的视频-文本跨模态实体分辨双重编码方法 被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

结合关键帧提取的视频-文本跨模态实体分辨双重编码方法被引量：5