基于S-YOLO V5和Vision Transformer的视频内容描述算法  被引量:1

Video Content Description Algorithm Based on S-YOLO V5 and Vision Transformer

在线阅读下载全文

作  者:徐鹏[1,3] 李铁柱 职保平[1] XU Peng;LI Tie-zhu;ZHI Bao-ping(Department of Information Engineering,Yellow River Conservancy Technical Institute,Kaifeng 475004,China;School of Computer and Information Engineering,Henan University,Kaifeng 475004,China;Virtual Reality Application Engineering Technology Research Center,Kaifeng 475004,China)

机构地区:[1]黄河水利职业技术学院信息工程学院,开封475004 [2]河南大学计算机与信息工程学院,开封475004 [3]开封市虚拟现实应用工程技术研究中心,开封475004

出  处:《印刷与数字媒体技术研究》2023年第4期212-222,共11页Printing and Digital Media Technology Study

基  金:国家自然科学基金青年项目——基于原型观测的水电机组-厂房结构振动传递路径识别研究(No.51709125);河南省科技攻关项目——概率-区间混合不确定的渡槽抗震可靠性研究——以沙河为例(No.212102310479)。

摘  要:视频内容描述的自动生成是结合计算机视觉和自然语言处理等相关技术提出的一种新型交叉学习任务。针对当前视频内容生成描述模型可读性不佳的问题,本研究提出一种基于S-YOLO V5和Vison Transformer(ViT)的视频内容描述算法。首先,基于神经网络模型KATNA提取关键帧,以最少帧数进行模型训练;其次,利用S-YOLO V5模型提取视频帧中的语义信息,并结合预训练ResNet101模型和预训练C3D模型提取视频静态视觉特征和动态视觉特征,并对两种模态特征进行融合;然后,基于ViT结构的强大长距离编码能力,构建模型编码器对融合特征进行长距离依赖编码;最后,将编码器的输出作为LSTM解码器的输入,依次输出预测词,生成最终的自然语言描述。通过在MSR-VTT数据集上进行测试,本研究模型的BLEU-4、METEOR、ROUGEL和CIDEr分别为42.9、28.8、62.4和51.4;在MSVD数据集上进行测试,本研究模型的BLEU-4、METEOR、ROUGEL和CIDEr分别为56.8、37.6、74.5以及98.5。与当前主流模型相比,本研究模型在多项评价指标上表现优异。Automatic generation of video content descriptions is a novel cross-learning task that combines computer vision and natural language processing techniques.In response to the problem of poor readability in description generation models for video content,a video content description algorithm based on S-YOLO V5 and Vision Transformer(ViT)was proposed in this study.Firstly,the key frames were extracted using the neural network model KATNA to train the model with minimal frames.Secondly,the S-YOLO V5 model was used to extract semantic information from video frames,and the pretrained ResNet101 model and pretrained C3D model were used to extract static and dynamic visual features from the video,respectively.The features from both modalities were fused.Then,leveraging the powerful long-range encoding capability of the ViT structure,a model encoder was constructed to encode the fused features with long-range dependencies.Finally,the output of the encoder was used as the input to the LSTM decoder,which sequentially generated predicted words to produce the final natural language description.Through testing on the MSR-VTT dataset,the proposed model achieved BLEU-4,METEOR,ROUGE-L,and CIDEr scores of 42.9,28.8,62.4,and 51.4,respectively.On the MSVD dataset,the model achieves BLEU-4,METEOR,ROUGE-L,and CIDEr scores of 56.8,37.6,74.5,and 98.5,respectively.Compared to the current mainstream models,the proposed model demonstrates excellent performance on multiple evaluation metrics.

关 键 词:视频内容描述 S-YOLO V5 Vision Transformer 多头注意力 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象