基于语义引导的视频描述生成

Video captioning based on semantic guidance

作　　者：石佳豪姚莉[1,2] SHI Jia-hao;YAO Li(School of Computer Science and Engineering,Southeast University,Nanjing Jiangsu 211189,China;Key Laboratory of Computer Network and Information Integration(Southeast University),Nanjing Jiangsu 211189,China)

机构地区：[1]东南大学计算机科学与工程学院,江苏南京211189 [2]计算机网络和信息集成教育部重点实验室(东南大学),江苏南京211189

出　　处：《图学学报》2023年第6期1191-1201,共11页Journal of Graphics

基　　金：南京市重大科技专项(202209003)。

摘　　要：视频描述生成旨在对给定的一段输入视频自动生成一句文本来概述发生的事件,其可用于视频检索、短视频标题生成、辅助视障、安防监控等领域。现有的方法忽视了语义信息在描述生成的作用,导致模型对于关键信息的描述能力不足。针对这一问题,设计了一个基于语义引导的视频描述生成模型。模型整体采用了编码器-解码器框架。在编码阶段首先使用语义增强模块生成关键实体及谓词,接着通过语义融合模块生成整体的语义表示;解码阶段使用词选择模块选择合适的词向量来引导描述生成,从而高效地利用语义信息,使结果更加关注关键语义。最后的实验表明该模型在2个广泛使用的数据集MSVD和MSR-VTT上分别取得107.0%和52.4%的Cider评分,优于最先进的模型。用户实验及可视化结果也证明了模型生成的描述符合人类的理解。Video captioning aims to automatically generate a sentence of text for a given piece of input video,summarizing the events in the video.This technology finds application in various fields,including video retrieval,short video title generation,assisting the visually impaired individuals,and security monitoring.However,existing methods tend to overlook the role of semantic information in description generation,resulting in insufficient description ability of the model for key information.To address this issue,a video captioning model based on semantic guidance was designed.This model as a whole adopted the encoder-decoder framework.In the encoding stage,a semantic enhancement module was employed to generate key entities and predicates.Subsequently,a semantic fusion module was utilized to generate the overall semantic representation.In the decoding stage,a word selection module was adopted to select the appropriate word vector,guiding the description generation to efficiently leverage semantic information and enhance the attention to the key semantics in the results.Finally,the experiment demonstrated that the model could achieve Cider scores of 107.0%and 52.4%on two widely used datasets:MSVD and MSR-VTT,respectively,outperforming the state-of-the-art model.User studies and visualization results corroborated that the descriptions generated by the model aligned well with human comprehension.

关键词：视频描述生成语义引导 TRANSFORMER 特征融合语义增强

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于语义引导的视频描述生成

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于语义引导的视频描述生成

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索