新闻类短视频关键帧摘要模型的研究与实现  

Research and Implementation of Key Frame Summarization Model for News Short Video

在线阅读下载全文

作  者:崔晓丹 刘达维 刘逸凡 赵志滨 任酉贵 闫永明 CUI Xiaodan;LIU Dawei;LIU Yifan;ZHAO Zhibin;REN Yougui;YAN Yongming(School of Computer Science and Engineering,Northeastern University,Shenyang 110169,China;Service Center of Natural Resource Affairs of Liaoning Province,Shenyang 110001,China;Shenyang Dixin Artificial Intelligence Industry Research Institute Co.,Ltd.,Shenyang 110136,China)

机构地区:[1]东北大学计算机科学与工程学院,沈阳110169 [2]辽宁省自然资源事务服务中心,沈阳110001 [3]沈阳帝信人工智能产业研究院有限公司,沈阳110136

出  处:《计算机工程》2023年第8期182-189,共8页Computer Engineering

摘  要:根据传播学的“声画关系”理论,新闻类短视频通过音频直接有效地传达视频内容,属于典型的“主声说”视频。现有视频摘要技术忽略了声画关系对视频内容表现的影响,导致其在特定类型短视频摘要任务中效果不稳定。针对新闻类短视频“主声”的特点,提出基于多模态特征语义相似性的新闻类短视频关键帧摘要模型。与传统融合模型不同,该模型在提取多模态特征的基础上,构建公共语义空间,通过最小化对比损失函数对图像-文本对进行联合训练,实现音频文本摘要与视频帧之间语义相似性的跨模态度量,在摘要生成任务中重点关注与音频中语义信息描述一致的图像内容,利用音频中的语义信息筛选相关关键帧,得到更准确的短视频摘要。采集450条CCTV新闻短视频和385条Bilibili自媒体新闻短视频组成实验数据集,使用F1值衡量不同模型的性能,实验结果表明,该模型在2个数据集上F1值分别达到62.8%和51.2%,相较于MSVA模型分别提升了2.1和0.8个百分点,在新闻类短视频关键帧摘要任务中具有更好的性能。According to the"sound and picture relationship"theory of communication,news short videos can directly and effectively convey the video content through audio,which belong to a typical voice-dominated video.Existing video summarization technologies ignore the influence of sound and picture relationships on the performance of video content,resulting in an unstable performance for specific types of short video summarization.Aiming at the characteristics of"voice-dominated"news short videos,this paper proposes a Key Frame Summarization model for News Short Video(KFS4NSV)based on the multimodal features semantic similarity.In contrast to the traditional fusion model,which is based on extracting multimodal features,this model constructs a common semantic space and jointly trains image-text pairs by minimizing the contrast loss function to achieve the cross-modal semantic similarity metric between audio text summarization and video frames.In the summarization generation task,the model focuses on image content consistent with the semantic information in the audio and uses the semantic information in the audio to filter relevant key frames and obtain a more accurate short video summarization.The experimental datasets consisted of 450 short CCTV news videos and 385 short Bilibili self-media news videos.The F1 value is introduced to measure the performance of different models,and the experimental results show that the F1 values of the proposed model on two datasets reach 62.8%and 51.2%,respectively,which are 2.1 and 0.8 percentage points higher,respectively,than those obtained using the MSVA model.The proposed model exhibits superior performance in the news short video key frame summarization task.

关 键 词:声画关系 主声说 多模态特征 语义相似性 关键帧摘要 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象