基于有序记忆网络的视频描述  

Video Caption Based on Ordered Memory Network

在线阅读下载全文

作  者:胡一康 杨莉[2] 陈淑琴 巫世峰[3,4] HU Yikang;YANG Li;CHEN Shuqin;WU Shifeng(Computer and Information Engineering College,Hubei Normal University,Huangshi 435002,China;College of Computer,Hubei University of Education,Wuhan 430205,China;Hubei Key Laboratory of Transportation Internet of Things(Wuhan Univer-sity of Technology),Wuhan 430064,China;ZhongQianLiYuan Engineering Consulting Co.,Ltd.,Wuhan 430071,China)

机构地区:[1]湖北师范大学计算机与信息工程学院,湖北黄石435002 [2]湖北第二师范学院计算机学院,湖北武汉430205 [3]交通物联网技术湖北省重点实验室(武汉理工大学),湖北武汉430064 [4]中乾立源工程咨询有限公司,湖北武汉430071

出  处:《软件导刊》2025年第4期154-163,共10页Software Guide

基  金:湖北省教育科学规划专项资助重点课题(光谷教师教育综合改革实验区专项)(2022ZA41);湖北省自然科学基金项目(2023AFB206);湖北第二师范学院人才引进科研启动经费项目(ESRC20230009);交通物联网技术湖北省重点实验室开放基金项目(WHUTIOT2023-006)。

摘  要:针对目前基于长短时记忆网络(LSTM)的视频描述模型忽视了生成的文本前后有关联逻辑问题,以及训练时的单词级交叉熵损失优化与句子级别的评价指标不能很好地匹配等问题,提出一个结合双向长短期记忆网络(BiLSTM)和有序记忆网络(ONLSTM)的编码解码模型。使用BiLSTM对输入的视频特征进行编码,并利用注意力机制加大重要特征的影响,实现距离较远视频帧间信息和依赖关系的有效记录和保留。使用ONLSTM进行解码,利用ONLSTM的无监督即可学习句子语法结构优异特性,通过对高层次和低层次不同更新的分区间更新手段,实现对层次特征进行学习以生成更准确且符合句子语法的视频内容描述。在MSR-VTT基准数据集上进行训练和测试,结果表明,有序神经元的加入,实现了在不丢失精度预测的基础上,对所有关键信息进行了保留和学习。To handle the issues of some current video caption models based on Long Short-Term Memory(LSTM)ignore the correlation logic before and after the generated caption and the problem that the optimization of word level cross-entropy loss during training can not match well with the sentence level evaluation criteria,an encoder-decoder model combining Bi-directional Long Short-Term Memory(BiLSTM)and ordered memory networks(ONLSTM)was proposed in this paper.A BiLSTM network was adopted to encode input video features,and the attention mechanism was also used to increase the influence of important features to implement effectively record and retain information and dependencies between video frames with a long distance.An ONLSTM was used for decoding,and the excellent features of sentence grammar structure can be learned by using ONLSTM without supervision.The hierarchical features can be learned to generate a more accurate and syntactic description of video content through hierarchical updates of different updates at high and low levels.The experiment was trained and tested on the MSR-VTT benchmark dataset.The results show that the addition of ordered neurons retains and learns all the key information on the basis of not losing the accuracy prediction.

关 键 词:视频描述 有序记忆网络 双向长短期记忆网络 注意力机制 深度学习 

分 类 号:TP391.41[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象