基于增强特征和注意力机制的视频表情识别  被引量:1

Video Facial Expression Recognition Based on ECNN-SA

在线阅读下载全文

作  者:李飞 陈瑞[2] 童莹[2] 陈乐[3] LI Fei;CHEN Rui;TONG Ying;CHEN Le(School of Electric Power Engineering,Nanjing Institute of Technology,Nanjing 211167,China;School of Information and Communication Engineering,Nanjing Institute of Technology,Nanjing 211167,China;School of Telecommunications and Information Engineering,Nanjing University of Posts andTelecommunications,Nanjing 210003,China)

机构地区:[1]南京工程学院电力工程学院,江苏南京211167 [2]南京工程学院信息与通信工程学院,江苏南京211167 [3]南京邮电大学通信与信息工程学院,江苏南京210003

出  处:《计算机技术与发展》2022年第11期183-189,共7页Computer Technology and Development

基  金:国家自然科学基金青年项目(61703201,61701221);江苏省自然科学基金青年项目(BK20170765);江苏省未来网络科研基金项目(FNSRFP2021YB26);江苏省研究生科研创新计划(SJCX21_0945)。

摘  要:端到端的CNN-LSTM模型利用卷积神经网络(Convolutional Neural Network, CNN)提取图像的空间特征,利用长短期记忆网络LSTM提取视频帧间的时间特征,在视频表情识别中得到了广泛的应用。但在学习视频帧的分层表示时,CNN-LSTM模型复杂度较高,且易发生过拟合。针对这些问题,提出一个高效、低复杂度的视频表情识别模型ECNN-SA (Enhanced Convolutional Neural Network with Self-Attention)。首先,将视频分成若干视频段,采用带增强特征分支的卷积神经网络和全局平均池化层提取视频段中每帧图像的特征向量。其次,利用自注意力(Self-Attention)机制获得特征向量间的相关性,根据相关性构建权值向量,主要关注视频段中的表情变化关键帧,引导分类器给出更准确的分类结果。最终,该模型在CK+和AFEW数据集上的实验结果表明,自注意力模块使得模型主要关注时间序列中表情变化的关键帧,相比于单层和多层的LSTM网络,ECNN-SA模型能更有效地对视频序列的情感信息进行分类识别。The end-to-end CNN-LSTM model uses the convolutional neural network(CNN)to extract the spatial features of the image,and uses the long-term and short-term memory(LSTM)network to extract the temporal features between video frames.It has been widely used in video expression recognition.However,when learning the hierarchical representation of video frames,the CNN-LSTM model is complicated and prone to over fitting.Aiming at these problems,an efficient video expression recognition model with low complexity named ECNN-SA(Enhanced Convolutional Neural Network with Self-Attention)is proposed.Firstly,a video is divided into several video segments.The feature vector of each frame in one video segment is extracted by an enhanced CNN with global average pooling layer.Secondly,the self-attention mechanism is used to obtain the correlation between feature vectors,and the weight vector is constructed according to the correlation.The self-attention module with low computational complexity is used to focus on the frames of interest,which is greatly related to expression classification.The experimental results on CK+and AFEW datasets show that the self-attention module makes the model mainly focus on the key frames of expression changes in the time series.Compared with the single-layer and multi-layer LSTM networks,the ECNN-SA model can classify and recognize the emotion information of the video sequence more effectively.

关 键 词:人脸表情识别 视频序列 自注意力机制 增强特征 卷积神经网络 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象