基于深度学习的视听多模态情感识别研究  被引量:3

Research on Audiovisual Multimodal Emotion Recognition Based onDeep Learning

在线阅读下载全文

作  者:李倩倩 王卫星[1] 杨勤[1] 陈治灸 秦晴 LI Qianqian;WANG Weixing;YANG Qin;CHEN Zhijiu;QIN Qing(School of Mechanical Engineering,Guizhou University,Guiyang 550025)

机构地区:[1]贵州大学机械工程学院,贵阳550025

出  处:《计算机与数字工程》2023年第3期695-699,共5页Computer & Digital Engineering

基  金:贵州省科学技术基金项目“基于深度图像的原生态民族舞蹈典型动作识别研究”(编号:黔科合基础[2020]1Y262);贵州省教育厅青年科技人才成长项目“基于语言值计算的数字动漫产品情感化配乐技术研究”(编号:黔教合KY字[2018]112);贵州大学引进人才项目“基于语义驱动的音乐与图像情感识别技术研究”(编号:贵大人基合字(2018)16号)资助。

摘  要:情感在同一情境下通常是逐渐变化的,而目前视听情感识别研究大部分集中在融合静态人脸图像特征和语音特征上,忽略了视频图像序列之间的时序关系,也忽略了姿态的作用。因此论文结合卷积神经网络(VGG)和长短期记忆网络(LSTM)构建了一个基于深度神经网络的视听多模态情感识别模型,整合了表情、姿态和语音的特征来进行视听情感识别。首先,使用VGG提取人脸图像和姿态图像的视觉特征,然后使用LSTM提取人脸图像序列和姿态图像序列的时序特征,同时使用opensmile提取音频特征,最后将提取的人脸、姿态和音频特征用DNN网络进行多特征的拼接融合以及情感分类。实验证明,与融合静态人脸图像特征与语音特征进行视听情感识别的方法相比,论文模型取得了更好的识别率,而加上姿态特征后,准确率又提升了6.1%。Emotions usually change gradually in the same context.At present,most research on audio-visual emotion recogni⁃tion focuses on the fusion of static facial image features and voice features,ignoring the temporal relationship between video image sequences and the role of gestures.Therefore,this paper combines convolutional neural network(VGG)and long short-term memo⁃ry network(LSTM)to construct an audio-visual multimodal emotion recognition model based on deep neural network,which inte⁃grates the features of expression,posture and speech to perform audio-visual emotion recognition.Firstly,VGG is used to extract the visual features of face images and pose images,then LSTM is used to extract the time series features of face image sequences and pose image sequences,and opensmile is used to extract audio features.Finally,the extracted face,pose,and audio features DNN network performs multi-feature splicing and fusion and emotion classification are used.Experiments show that compared with the method of fusing static facial image features and voice features for audiovisual emotion recognition,the model in this paper achieves a better recognition rate,and after adding gesture features,the accuracy rate is increased by 6.1%.

关 键 词:深度学习 情感识别 视觉特征 时序特征 特征融合 

分 类 号:TP39[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象