检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:石佳豪 姚莉[1,2] SHI Jia-hao;YAO Li(School of Computer Science and Engineering,Southeast University,Nanjing Jiangsu 211189,China;Key Laboratory of Computer Network and Information Integration(Southeast University),Nanjing Jiangsu 211189,China)
机构地区:[1]东南大学计算机科学与工程学院,江苏南京211189 [2]计算机网络和信息集成教育部重点实验室(东南大学),江苏南京211189
出 处:《图学学报》2023年第6期1191-1201,共11页Journal of Graphics
基 金:南京市重大科技专项(202209003)。
摘 要:视频描述生成旨在对给定的一段输入视频自动生成一句文本来概述发生的事件,其可用于视频检索、短视频标题生成、辅助视障、安防监控等领域。现有的方法忽视了语义信息在描述生成的作用,导致模型对于关键信息的描述能力不足。针对这一问题,设计了一个基于语义引导的视频描述生成模型。模型整体采用了编码器-解码器框架。在编码阶段首先使用语义增强模块生成关键实体及谓词,接着通过语义融合模块生成整体的语义表示;解码阶段使用词选择模块选择合适的词向量来引导描述生成,从而高效地利用语义信息,使结果更加关注关键语义。最后的实验表明该模型在2个广泛使用的数据集MSVD和MSR-VTT上分别取得107.0%和52.4%的Cider评分,优于最先进的模型。用户实验及可视化结果也证明了模型生成的描述符合人类的理解。Video captioning aims to automatically generate a sentence of text for a given piece of input video,summarizing the events in the video.This technology finds application in various fields,including video retrieval,short video title generation,assisting the visually impaired individuals,and security monitoring.However,existing methods tend to overlook the role of semantic information in description generation,resulting in insufficient description ability of the model for key information.To address this issue,a video captioning model based on semantic guidance was designed.This model as a whole adopted the encoder-decoder framework.In the encoding stage,a semantic enhancement module was employed to generate key entities and predicates.Subsequently,a semantic fusion module was utilized to generate the overall semantic representation.In the decoding stage,a word selection module was adopted to select the appropriate word vector,guiding the description generation to efficiently leverage semantic information and enhance the attention to the key semantics in the results.Finally,the experiment demonstrated that the model could achieve Cider scores of 107.0%and 52.4%on two widely used datasets:MSVD and MSR-VTT,respectively,outperforming the state-of-the-art model.User studies and visualization results corroborated that the descriptions generated by the model aligned well with human comprehension.
关 键 词:视频描述生成 语义引导 TRANSFORMER 特征融合 语义增强
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.185