融合语义增强与多注意力机制的视频描述方法  

A video captioning method combining semantic enhancement and multiple attention mechanisms

在线阅读下载全文

作  者:任剑洪 曾勍炜 李向军[2] 龚政[2] 刘方 REN Jianhong;ZENG Qingwei;LI Xiangjun;GONG Zheng;LIU Fang(Educational Media Construction Division,Jiangxi Education and Assessement Institute,Nanchang 330038,China;a.School of Software,Nanchang University,Nanchang 330046,China;Network Center,Nanchang University,Nanchang 330046,China)

机构地区:[1]江西省教育评估监测研究院教育融媒体建设处,江西南昌330038 [2]南昌大学软件学院,江西南昌330046 [3]南昌大学网络中心,江西南昌330046

出  处:《南昌大学学报(理科版)》2023年第6期548-555,共8页Journal of Nanchang University(Natural Science)

基  金:国家自然科学基金项目(62262039,62262023),江西省科技创新平台项目(20181BCD40005);南昌大学江西省财政科技专项“包干制”试点示范项目(ZBG20230418014);江西省教育厅科学技术研究项目(GJJ2210701);江西省教学改革重点项目(JXJG-2020-1-2);江西省研究生创新专项资金项目(YC2023-S012,YC2023-S015,YC2023-S099);江西省高等学校大学生创新创业训练计划项目(202210403057,202310403001X,S202310403010,S202310403037)。

摘  要:随着视频数据爆发式增长,视频描述任务越来越被研究者们关注。如何让计算机像人类一样理解视频的内容并能够准确无误地用语言表达出来,是视频描述任务领域尚未得到完美解决的难题之一。针对现有代表性视频描述模型中存在的未充分利用语义信息、生成描述不准确等问题,本文基于编码器-解码器框架的视频描述模型,提出了一种融合语义增强与多注意力机制的视频描述方法。该方法首先通过视觉文本特征聚合方法,为模型编码提供高层语义指导。然后,使用Faster-RCNN网络提取视频对象特征,通过图卷积网络获取视频对象的潜在语义信息,得到增强特征。最后,引入多重注意力机制,使模型更好地利用输入信息,增强模型的学习能力。MSVD和MSR-VTT数据集上的实验结果表明,相比于基准模型,本文提出的方法能合理优化视频描述模型的输入信息,有效提取视频潜在语义,从而解决视频文本跨模态问题和生成语句的语法结构问题,并能有效提升视频描述模型的准确度和对复杂场景的描述能力,更具先进性。With the explosive growth of video data,the task of video captioning had been paid more and more attention by researchers.How to enable computers to understand the content of the video and express it accurately was one of the difficult problems that had not been solved perfectly in the field of video captioning task.Aiming at the problems of insufficient use of semantic information and inaccurate description in the video captioning model,a video captioning method combining semantic enhancement and multi-attention mechanism was proposed based on the encoder-decoder framework in this paper.Firstly,visual text feature aggregation was used to provide high-level semantic guidance for model coding.Then,the Faster-RCNN network was used to extract the features of the video object,and the potential semantic information of the video object was obtained through the graph convolutional network,resulting in enhanced features.Finally,amultiple attention mechanism was introduced to better utilize input information and enhance the learning ability of the model.The experimental results on MSVD and MSR-VTT data sets showed that,compared with the benchmark model,the proposed method can reasonably optimize the input information of the video description model,effectively extract the video potential semantics,and solve the cross-modal problem of video text,and resolve the syntax structure of generated statements.This method can effectively improve the accuracy of video description model and the ability to describe complex scenes,showing its advancement.

关 键 词:视频描述 高层语义 图神经网络 注意力机制 特征增强 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象