基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨

Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism

作　　者：曾志贤曹建军翁年凤蒋国权徐滨 ZENG Zhi-xian;CAO Jian-jun;WENG Nian-feng;JIANG Guo-quan;XU Bin(Sixty-third Research Institute,National University of Defense Technology,Nanjing 210007,China)

机构地区：[1]中国人民解放军国防科技大学第六十三研究所,南京210007

出　　处：《计算机科学》2022年第7期106-112,共7页Computer Science

基　　金：国家自然科学基金(61371196);中国博士后科学基金(2015M582832)。

摘　　要：随着移动网络、自媒体平台的迅速发展,大量的视频和文本信息不断涌现,这给视频-文本数据跨模态实体分辨带来了迫切的现实需求。为提高视频-文本跨模态实体分辨的性能,提出了一种基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨模型(Fine-grained Semantic Association Video-Text Cross-Model Entity Resolution Model Based on Attention Mechanism,FSAAM)。对于视频中的每一帧,利用图像特征提取网络特征信息,并将其作为特征表示,然后通过全连接网络进行微调,将每一帧映射到共同空间;同时,利用词嵌入的方法对文本描述中的词进行向量化处理,通过双向递归神经网络将其映射到共同空间。在此基础上,提出了一种自适应细粒度视频-文本语义关联方法,该方法计算文本描述中的每个词与视频帧的相似度,利用注意力机制进行加权求和,得出视频帧与文本的语义相似度,并过滤与文本语义相似度较低的帧,提高了模型性能。FSAAM主要解决了文本描述的词与视频帧关联程度不同而导致视频-文本跨模态数据语义关联难以构建以及视频冗余帧的问题,在MSR-VTT和VATEX数据集上进行了实验,实验结果验证了所提方法的优越性。With the rapid development of mobile network and we-media platform,lots of video and text information are generated,which bring an urgent demand for video-text cross-modal entity resolution.In order to improve the performance of video-text cross-modal entity resolution,a novel fine-grained semantic association video-text cross-model entity resolution model based on attention mechanism(FSAAM)is proposed.For each frame in video,the feature information is extracted by the image feature extraction network as a feature representation,which will be fine-tuned by the fully connected network and mapped to a common space.At the same time,the words in the text description are vectorized by word embedding,and mapped to a common space by the bi-directional recurrent neural network.On this basis,an adaptive fine-grained video-text semantic association method is proposed to calculate the similarity between each word in text and the frame in video.The attention mechanism is used for weighted summation to obtain the semantic similarity between the frame in video and the text description,and frames with small semantic similarity with the text are filtered to improve the model's performance.FSAAM mainly solves the problem that there is a great quantity of redundant information in video and a large number of words with little contribution in text,and it is difficult to construct video-text semantic association due to the different degree of association between words and frames.Experiments on MSR-VTT and VATEX datasets demonstrate the superiority of the proposed method.

关键词：跨模态实体分辨共同空间注意力机制细粒度语义相似度特征提取

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索