基于多模态融合的无监督视频摘要算法研究  

Research on Unsupervised Video Summarization Algorithm Based on Multimodal Fusion

在线阅读下载全文

作  者:潘涛 陈虎[1,2] 黄菊 吴长柯 邓彪 吴志红[1,2] PAN Tao;CHEN Hu;HUANG Ju;WU Chang-ke;DENG Biao;WU Zhi-hong(School of Computer Science,Sichuan University,Chengdu 610065,China;State Key Laboratory of Fundamental Science on Synthetic Vision,Sichuan University,Chengdu 610065,China;Dongfang Electric Corporation,Chengdu 610036,China)

机构地区:[1]四川大学计算机学院,四川成都610065 [2]四川大学视觉合成图形图像技术重点学科实验室,四川成都610065 [3]中国东方电气集团有限公司,四川成都610036

出  处:《计算机技术与发展》2024年第11期29-35,共7页Computer Technology and Development

基  金:国家自然科学基金重点项目(U20A20162);四川省科技计划项目(2022JDJQ0045)

摘  要:生成算法通过选择视频内容中信息最丰富的部分来构建形成简洁而完整的概要,有利于快速了解视频内容并压缩存储空间。针对现有视频摘要方法存在的视频多模态信息利用不充分、特征表达能力弱等难题,该文提出了一种基于多模态融合及多尺度时序信息的无监督视频摘要生成算法。首先,基于视频图像、音频、文本特征,提出了一种两阶段特征融合模块,充分保留了模态间的非冗余信息,提升单帧特征表示能力;其次,采用自注意力和特征金字塔网络对融合特征进行全局及局部的依赖建模;最后,根据多尺度的上下文信息选择关键帧最终构成高质量的视频摘要。实验结果表明,与其他无监督视频摘要算法相比,该算法在SumMe数据集规范设置及增强设置中F-Score分别提升了0.5百分点和1.4百分点,在TVSum数据集上达到最佳F-Score。The aim of video summarization is to construct concise and comprehensive summaries by selecting the most important content of the video,facilitating a rapid understanding of the video and conserving storage space.Existing methods face challenges including inadequate utilization of multimodal information and weak feature expression capabilities.We propose an unsupervised video summarization algorithm based on multimodal fusion and multiscale temporal information.Firstly,we introduce a two-stage feature fusion module based on video images,audio,and text features,preserving non-redundant information between modalities and enhancing the representation capability of features.Then,we employ self-attention and feature pyramid networks to obtain global and local temporal dependencies,select keyframes based on multi-scale contextual information,and form a high-quality video summary.The experimental results demonstrate that compared to other unsupervised video summarization algorithms,the proposed algorithm has achieved an improvement of 0.5 percentage points and 1.4 percentage points in F-Score on the SumMe dataset under canonical and augmented settings,respectively.Moreover,it has achieved the highest F-Score on the TVSum dataset.

关 键 词:无监督视频摘要 多模态融合 自注意力网络 特征金字塔网络 特征编码 

分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象