融合多模态特征与时区检测的视频摘要算法  

Research on video summarization algorithm fusing multimodalfeatures and time zone detection

在线阅读下载全文

作  者:白晨 范涛 王文静 王国中 Bai Chen;Fan Tao;Wang Wenjing;Wang Guozhong(School of Electronic&Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China)

机构地区:[1]上海工程技术大学电子电气工程学院,上海201620

出  处:《计算机应用研究》2023年第11期3276-3281,3288,共7页Application Research of Computers

基  金:国家重点研发计划重点专项2019年度资助项目(2019YFB180270200)。

摘  要:针对传统视频摘要算法没有充分利用视频的多模态信息、难以确保摘要视频片段时序一致性的问题,提出了一种融合多模态特征与时区检测的视频摘要算法(MTNet)。首先,通过GoogLeNet与VGGish预训练模型提取视频图像与音频的特征表示,设计了一种维度平滑操作对齐两种模态特征,使模型具备全面的表征能力;其次,考虑到生成的视频摘要应具备全局代表性,因此通过单双层自注意力机制结合残差结构分别提取视频图像与音频特征的长范围时序特征,获取模型在时序范围的单一向量表示;最后,通过分离式时区检测与权值共享方法对视频逐个时序片段的摘要边界与重要性进行预测,并通过非极大值抑制来选取关键视频片段生成视频摘要。实验结果表明,在两个标准数据集SumMe与TvSum上,MTNet的表征能力与鲁棒性更强;它的F 1值相较基于无锚框的视频摘要算法DSNet-AF以及基于镜头重要性预测的视频摘要算法VASNet,在两个数据集上分别有所提高。To address the issues of traditional video summarization algorithms not fully utilizing multimodal information in videos and struggling to ensure temporal consistency of summary video segments,this paper proposed a new video summarization algorithm(MTNet)that fused multimodal features and temporal zone detection.Firstly,it extracted the visual and audio features of the videos using pre-trained GoogLeNet and VGGish models,and designed a dimension smoothing operation to align the two modal features,endowing the model with comprehensive representation capabilities.Secondly,considering that the generated video summaries should have global representativeness,it combined single and double-layer self-attention mechanisms with residual structures.to extract long-range temporal features of video images and audio features,obtaining a single vector representation of the model in the temporal domain.Lastly,it predicted summary boundaries and importance of indivi-dual temporal segments in the video using separated temporal zone detection and weight sharing methods.It selected key video segments to generate video summaries through non-maximum suppression.Experimental results show that MTNet has stronger representation capabilities and robustness on two standard datasets,SumMe and TvSum.It achieves an increase in F 1-score compared to the anchor-free video summarization algorithm DSNet-AF and the shot importance prediction-based video summarization algorithm VASNet on both datasets.

关 键 词:多模态特征 特征融合 视频摘要 时区检测 注意力机制 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象