基于多模注意力机制的密集型视频描述

Dense video captioning research based on multi-mode attention mechanism

作　　者：杨昊韩翠玲[2] 王玉德[1] 高张弛 YANG Hao;HAN Cuiling;WANG Yude;GAO Zhangchi(School of Cyber Science and Engineering,Qufu Normal University,273165,Qufu;School of Information Engineering,Shandong Polytechnic College,272067,Jining,Shandong,PRC)

机构地区：[1]曲阜师范大学网络空间安全学院,曲阜市273165 [2]山东理工职业学院信息工程学院,山东省济宁市272067

出　　处：《曲阜师范大学学报（自然科学版）》2023年第2期62-70,共9页Journal of Qufu Normal University(Natural Science)

基　　金：山东省研究生导师指导能力提升计划项目(SDYY18119);山东省研究生教学案例库建设项目(SDYAL21090).

摘　　要：为了解决密集型视频描述(dense video captioning,DVC)任务中视频特征利用不充分,视频定位分段不准确,语义描述效果不丰富的问题,采用多模注意力机制的密集型视频描述方法,提取视频中的视觉特征、音频特征和语音特征.通过多模注意力机制,在编码器中计算不同模态视频帧特征间的关联程度,在解码器中计算描述词序列特征与编码器输出的多模态视频帧特征间的关联程度,并将编码器、解码器输出特征分别作用于视频定位分段模型和语义描述模型获得视频分段和分段描述.提出的方法在ActivityNet Captions数据集上进行了理论分析和实验验证,其中F1-score达到60.09,METEOR指标达到8.78.该方法有效提高了视频定位分段和语义描述的准确性.In order to solve problems by the example of insufficient utilization of video features,inaccurate video positioning and segmentation,and scarce semantic captioning effect in the Dense Video Captioning(DVC)task,a dense video captioning method based on multi-modal attention mechanism is adopted.The visual features,audio features and speech features in the video are extracted,and the attention mechanism is introduced.The correlation degree between video frame features of different modes is calculated through the multi-mode attention mechanism in the encoder.The correlation degree between the captioning word sequence features and the multi-modal video frame features output by the encoder is calculated through the multi-mode attention mechanism in the decoder.The output features of encoder and decoder are applied to video segmentation module and semantic captioning module respectively to obtain video segmentation and segmentation captioning.The method has been through theoretical analysis and experimental verification on ActivityNet Captions data set.The F1-score index reaches 60.09 and the METEOR index reaches 8.78,which has improved the accuracy of video locating segmentation and semantic captioning effectively.

关键词：密集型视频描述多模态视频特征特征融合多模注意力机制

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多模注意力机制的密集型视频描述

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多模注意力机制的密集型视频描述

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索