检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:杨昊 韩翠玲[2] 王玉德[1] 高张弛 YANG Hao;HAN Cuiling;WANG Yude;GAO Zhangchi(School of Cyber Science and Engineering,Qufu Normal University,273165,Qufu;School of Information Engineering,Shandong Polytechnic College,272067,Jining,Shandong,PRC)
机构地区:[1]曲阜师范大学网络空间安全学院,曲阜市273165 [2]山东理工职业学院信息工程学院,山东省济宁市272067
出 处:《曲阜师范大学学报(自然科学版)》2023年第2期62-70,共9页Journal of Qufu Normal University(Natural Science)
基 金:山东省研究生导师指导能力提升计划项目(SDYY18119);山东省研究生教学案例库建设项目(SDYAL21090).
摘 要:为了解决密集型视频描述(dense video captioning,DVC)任务中视频特征利用不充分,视频定位分段不准确,语义描述效果不丰富的问题,采用多模注意力机制的密集型视频描述方法,提取视频中的视觉特征、音频特征和语音特征.通过多模注意力机制,在编码器中计算不同模态视频帧特征间的关联程度,在解码器中计算描述词序列特征与编码器输出的多模态视频帧特征间的关联程度,并将编码器、解码器输出特征分别作用于视频定位分段模型和语义描述模型获得视频分段和分段描述.提出的方法在ActivityNet Captions数据集上进行了理论分析和实验验证,其中F1-score达到60.09,METEOR指标达到8.78.该方法有效提高了视频定位分段和语义描述的准确性.In order to solve problems by the example of insufficient utilization of video features,inaccurate video positioning and segmentation,and scarce semantic captioning effect in the Dense Video Captioning(DVC)task,a dense video captioning method based on multi-modal attention mechanism is adopted.The visual features,audio features and speech features in the video are extracted,and the attention mechanism is introduced.The correlation degree between video frame features of different modes is calculated through the multi-mode attention mechanism in the encoder.The correlation degree between the captioning word sequence features and the multi-modal video frame features output by the encoder is calculated through the multi-mode attention mechanism in the decoder.The output features of encoder and decoder are applied to video segmentation module and semantic captioning module respectively to obtain video segmentation and segmentation captioning.The method has been through theoretical analysis and experimental verification on ActivityNet Captions data set.The F1-score index reaches 60.09 and the METEOR index reaches 8.78,which has improved the accuracy of video locating segmentation and semantic captioning effectively.
关 键 词:密集型视频描述 多模态视频特征 特征融合 多模注意力机制
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.171