机构地区:[1]苏州大学计算机科学与技术学院,江苏苏州215006 [2]嘉兴学院数理与信息工程学院,浙江嘉兴314001 [3]常熟理工学院计算机科学与工程学院,江苏常熟215500 [4]北京交通大学计算机与信息技术学院,北京100044 [5]吉林大学符号计算与知识工程教育部重点实验室,长春130012
出 处:《计算机学报》2018年第12期2852-2866,共15页Chinese Journal of Computers
基 金:国家自然科学基金(61773272;61170124;61272258;61301299);教育部科技发展中心"云数融合科教创新"基金(2017B03112);江苏省自然科学基金(BK20151260;BK20151254);浙江省自然科学基金(LY15F020039);江苏省"六大人才高峰"项目(DZXX-027);吉林大学符号计算与知识工程教育部重点实验室基金项目(93K172016K08);江苏省研究生科研与实践创新计划项目(KYCX17_2006)资助~~
摘 要:监控视频下的事件识别是近期计算机视觉领域的研究热点之一.然而,自然场景下监控视频往往具有背景复杂、事件区域内对象遮挡严重等特点,使得事件类内差异大、类间差异小,给识别带来了很大的困难.为解决复杂背景下事件识别问题,提出了一种基于深度残差双单向DLSTM(DRDU-DLSTM)的时空一致视频事件识别方法.该方法首先从训练好的时间CNN网络和空间CNN网络获取视频的时空深度特征,经LSTM同步解析后形成时空特征数据联接单元DLSTM,并作为残差网络的输入.双单向传递的DLSTM联接后构成DU-DLSTM层;多个DU-DLSTM层再加一个恒等映射形成残差模块;在此基础上,多层的残差模块堆叠构成了深度残差网络架构.为了进一步优化识别结果,设计了基于双中心Loss的2C-softmax目标函数,在最大化类间距离的同时最小化类内间隔距离.在监控视频数据集VIRAT 1.0和VIRAT 2.0上的实验表明,该文提出的事件识别方法有很好的性能表现和稳定性,识别准确率分别提高了5.1%和7.3%.Event recognition in surveillance video is attracting growing interest in recent years. Nevertheless, event recognition in real-world surveillance video still faces great challenges due to various facets such as cluttered background, severe occlusion in event bounding box, tremendous intra-class variations while small inter-class variations, etc. A pronounced tendency is that more researches focus on learning deep features from raw data. Two-stream CNNs (Convolutional Neural Networks) architecture becomes a very successful model in video analysis field, in which appearance features and short-term motion features are utilized. In contrast, Long Short-Term Memory (LSTM) network can learn long-term motion features from the input sequence, which is widely used to process those tasks with quintessential time series. In order to combine the advantages of the two types of networks, in this paper, we propose a deep residual dual unidirectional double LSTM (DRDU - DLSTM) for video event recognition in surveillance video with complex scenes. In the first place, deep features are extracted from the fine - tuned temporal CNN and spatial CNN. Since fully connected layers (FC) takes more semantic information than convolutional layers, which are more suitable as the inputs of LSTM network, we extract FC6 feature of spatial CNN and FC7 feature of temporal CNN respectively. Secondly, to reinforce spatial-temporal consistency, the deep features are transformed by spatial LSTM (SLSTM) and temporal LSTM (TLSTM) respectively, and conjugated as a unit called double - LSTM (DLSTM), which forms the input of the residual network. DLSTM cells increase the number of hidden nodes of LSTM cells, and expand the width of the networks. The input features of spatial CNN and temporal CNN are deeply intertwined by DLSTM cells. At the same time, the features will be transmitted and evolved simultaneously, which will increase the consistency of spatial and temporal features. Furthermore, dual unidirectional DLSTMs are con
关 键 词:事件识别 时空一致 残差网络 LSTM 双单向 DLSTM 深度特征 监控视频
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...