基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨  

Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism

在线阅读下载全文

作  者:曾志贤 曹建军 翁年凤 蒋国权 徐滨 ZENG Zhi-xian;CAO Jian-jun;WENG Nian-feng;JIANG Guo-quan;XU Bin(Sixty-third Research Institute,National University of Defense Technology,Nanjing 210007,China)

机构地区:[1]中国人民解放军国防科技大学第六十三研究所,南京210007

出  处:《计算机科学》2022年第7期106-112,共7页Computer Science

基  金:国家自然科学基金(61371196);中国博士后科学基金(2015M582832)。

摘  要:随着移动网络、自媒体平台的迅速发展,大量的视频和文本信息不断涌现,这给视频-文本数据跨模态实体分辨带来了迫切的现实需求。为提高视频-文本跨模态实体分辨的性能,提出了一种基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨模型(Fine-grained Semantic Association Video-Text Cross-Model Entity Resolution Model Based on Attention Mechanism,FSAAM)。对于视频中的每一帧,利用图像特征提取网络特征信息,并将其作为特征表示,然后通过全连接网络进行微调,将每一帧映射到共同空间;同时,利用词嵌入的方法对文本描述中的词进行向量化处理,通过双向递归神经网络将其映射到共同空间。在此基础上,提出了一种自适应细粒度视频-文本语义关联方法,该方法计算文本描述中的每个词与视频帧的相似度,利用注意力机制进行加权求和,得出视频帧与文本的语义相似度,并过滤与文本语义相似度较低的帧,提高了模型性能。FSAAM主要解决了文本描述的词与视频帧关联程度不同而导致视频-文本跨模态数据语义关联难以构建以及视频冗余帧的问题,在MSR-VTT和VATEX数据集上进行了实验,实验结果验证了所提方法的优越性。With the rapid development of mobile network and we-media platform,lots of video and text information are generated,which bring an urgent demand for video-text cross-modal entity resolution.In order to improve the performance of video-text cross-modal entity resolution,a novel fine-grained semantic association video-text cross-model entity resolution model based on attention mechanism(FSAAM)is proposed.For each frame in video,the feature information is extracted by the image feature extraction network as a feature representation,which will be fine-tuned by the fully connected network and mapped to a common space.At the same time,the words in the text description are vectorized by word embedding,and mapped to a common space by the bi-directional recurrent neural network.On this basis,an adaptive fine-grained video-text semantic association method is proposed to calculate the similarity between each word in text and the frame in video.The attention mechanism is used for weighted summation to obtain the semantic similarity between the frame in video and the text description,and frames with small semantic similarity with the text are filtered to improve the model's performance.FSAAM mainly solves the problem that there is a great quantity of redundant information in video and a large number of words with little contribution in text,and it is difficult to construct video-text semantic association due to the different degree of association between words and frames.Experiments on MSR-VTT and VATEX datasets demonstrate the superiority of the proposed method.

关 键 词:跨模态实体分辨 共同空间 注意力机制 细粒度 语义相似度 特征提取 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象