Video description with subject, verb and object supervision

Video description with subject, verb and object supervision

作　　者：Wang Yue Liu Jinlai Wang Xiaojie

机构地区：[1]School of Computer Science,Beijing University of Posts and Telecommunications

出　　处：《The Journal of China Universities of Posts and Telecommunications》2019年第2期52-58,共7页中国邮电高校学报（英文版）

基　　金：supported by the National Natural Science Foundation of China(61273365);111 Project(B08004)

摘　　要：Video description aims to generate descriptive natural language for videos.Inspired from the deep neural network(DNN) used in the machine translation,the video description(VD) task applies the convolutional neural network(CNN) to extracting video features and the long short-term memory(LSTM) to generating descriptions.However,some models generate incorrect words and syntax.The reason may because that the previous models only apply LSTM to generate sentences,which learn insufficient linguistic information.In order to solve this problem,an end-to-end DNN model incorporated subject,verb and object(SVO) supervision is proposed.Experimental results on a publicly available dataset,i.e.Youtube2 Text,indicate that our model gets a 58.4% consensus-based image description evaluation(CIDEr) value.It outperforms the mean pool and video description with first feed(VD-FF) models,demonstrating the effectiveness of SVO supervision.Video description aims to generate descriptive natural language for videos. Inspired from the deep neural network(DNN) used in the machine translation, the video description(VD) task applies the convolutional neural network(CNN) to extracting video features and the long short-term memory(LSTM) to generating descriptions. However, some models generate incorrect words and syntax. The reason may because that the previous models only apply LSTM to generate sentences, which learn insufficient linguistic information. In order to solve this problem, an end-to-end DNN model incorporated subject, verb and object(SVO) supervision is proposed. Experimental results on a publicly available dataset, i.e. Youtube2 Text, indicate that our model gets a 58.4% consensus-based image description evaluation(CIDEr) value. It outperforms the mean pool and video description with first feed(VD-FF) models, demonstrating the effectiveness of SVO supervision.

关键词：VD DNN CNN LSTM

分类号：TN[电子电信]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Video description with subject, verb and object supervision

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Video description with subject, verb and object supervision

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索