检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:徐鹏[1,3] 李铁柱 职保平[1] XU Peng;LI Tie-zhu;ZHI Bao-ping(Department of Information Engineering,Yellow River Conservancy Technical Institute,Kaifeng 475004,China;School of Computer and Information Engineering,Henan University,Kaifeng 475004,China;Virtual Reality Application Engineering Technology Research Center,Kaifeng 475004,China)
机构地区:[1]黄河水利职业技术学院信息工程学院,开封475004 [2]河南大学计算机与信息工程学院,开封475004 [3]开封市虚拟现实应用工程技术研究中心,开封475004
出 处:《印刷与数字媒体技术研究》2023年第4期212-222,共11页Printing and Digital Media Technology Study
基 金:国家自然科学基金青年项目——基于原型观测的水电机组-厂房结构振动传递路径识别研究(No.51709125);河南省科技攻关项目——概率-区间混合不确定的渡槽抗震可靠性研究——以沙河为例(No.212102310479)。
摘 要:视频内容描述的自动生成是结合计算机视觉和自然语言处理等相关技术提出的一种新型交叉学习任务。针对当前视频内容生成描述模型可读性不佳的问题,本研究提出一种基于S-YOLO V5和Vison Transformer(ViT)的视频内容描述算法。首先,基于神经网络模型KATNA提取关键帧,以最少帧数进行模型训练;其次,利用S-YOLO V5模型提取视频帧中的语义信息,并结合预训练ResNet101模型和预训练C3D模型提取视频静态视觉特征和动态视觉特征,并对两种模态特征进行融合;然后,基于ViT结构的强大长距离编码能力,构建模型编码器对融合特征进行长距离依赖编码;最后,将编码器的输出作为LSTM解码器的输入,依次输出预测词,生成最终的自然语言描述。通过在MSR-VTT数据集上进行测试,本研究模型的BLEU-4、METEOR、ROUGEL和CIDEr分别为42.9、28.8、62.4和51.4;在MSVD数据集上进行测试,本研究模型的BLEU-4、METEOR、ROUGEL和CIDEr分别为56.8、37.6、74.5以及98.5。与当前主流模型相比,本研究模型在多项评价指标上表现优异。Automatic generation of video content descriptions is a novel cross-learning task that combines computer vision and natural language processing techniques.In response to the problem of poor readability in description generation models for video content,a video content description algorithm based on S-YOLO V5 and Vision Transformer(ViT)was proposed in this study.Firstly,the key frames were extracted using the neural network model KATNA to train the model with minimal frames.Secondly,the S-YOLO V5 model was used to extract semantic information from video frames,and the pretrained ResNet101 model and pretrained C3D model were used to extract static and dynamic visual features from the video,respectively.The features from both modalities were fused.Then,leveraging the powerful long-range encoding capability of the ViT structure,a model encoder was constructed to encode the fused features with long-range dependencies.Finally,the output of the encoder was used as the input to the LSTM decoder,which sequentially generated predicted words to produce the final natural language description.Through testing on the MSR-VTT dataset,the proposed model achieved BLEU-4,METEOR,ROUGE-L,and CIDEr scores of 42.9,28.8,62.4,and 51.4,respectively.On the MSVD dataset,the model achieves BLEU-4,METEOR,ROUGE-L,and CIDEr scores of 56.8,37.6,74.5,and 98.5,respectively.Compared to the current mainstream models,the proposed model demonstrates excellent performance on multiple evaluation metrics.
关 键 词:视频内容描述 S-YOLO V5 Vision Transformer 多头注意力
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28