借助语音和面部图像的双模态情感识别  

Emotion Recognition with Speech and Facial Images

在线阅读下载全文

作  者:薛珮芸 戴书涛 白静[1] 高翔[1] XUE Peiyun;DAI Shutao;BAI Jing;GAO Xiang(Taiyuan University of Technology,College of Electronic Information and Optical Engineering,Taiyuan 030024,China;Shanxi Advanced Innovation Research Institute,Postdoctoral orkstation,Taiyuan 030032,China)

机构地区:[1]太原理工大学电子信息与光学工程学院,太原030024 [2]山西省高等创新研究院博士后工作站,太原030032

出  处:《电子与信息学报》2024年第12期4542-4552,共11页Journal of Electronics & Information Technology

基  金:山西省青年基金(20210302124544);山西省应用基础研究计划(201901D111094)。

摘  要:为提升情感识别模型的准确率,解决情感特征提取不充分的问题,对语音和面部图像的双模态情感识别进行研究。语音模态提出一种结合通道-空间注意力机制的多分支卷积神经网络(Multi-branch Convolutional Neural Networks, MCNN)的特征提取模型,在时间、空间和局部特征维度对语音频谱图提取情感特征;面部图像模态提出一种残差混合卷积神经网络(Residual Hybrid Convolutional Neural Network, RHCNN)的特征提取模型,进一步建立并行注意力机制关注全局情感特征,提高识别准确率;将提取到的语音和面部图像特征分别通过分类层进行分类识别,并使用决策融合对识别结果进行最终的融合分类。实验结果表明,所提双模态融合模型在RAVDESS, eNTERFACE’05, RML三个数据集上的识别准确率分别达到了97.22%, 94.78%和96.96%,比语音单模态的识别准确率分别提升了11.02%, 4.24%, 8.83%,比面部图像单模态的识别准确率分别提升了4.60%, 6.74%,4.10%,且与近年来对应数据集上的相关方法相比均有所提升。说明了所提的双模态融合模型能有效聚焦情感信息,从而提升情感识别的准确率。In order to improve the accuracy of emotion recognition models and solve the problem of insufficient emotional feature extraction,this paper conducts research on bimodal emotion recognition involving audio and facial imagery.In the audio modality,a feature extraction model of a Multi-branch Convolutional Neural Network(MCNN)incorporating a channel-space attention mechanism is proposed,which extracts emotional features from speech spectrograms across time,space,and local feature dimensions.For the facial image modality,a feature extraction model using a Residual Hybrid Convolutional Neural Network(RHCNN)is introduced,which further establishes a parallel attention mechanism that concentrates on global emotional features to enhance recognition accuracy.The emotional features extracted from audio and facial imagery are then classified through separate classification layers,and a decision fusion technique is utilized to amalgamate the classification results.The experimental results indicate that the proposed bimodal fusion model has achieved recognition accuracies of 97.22%,94.78%,and 96.96%on the RAVDESS,eNTERFACE’05,and RML datasets,respectively.These accuracies signify improvements over single-modality audio recognition by 11.02%,4.24%,and 8.83%,and single-modality facial image recognition by 4.60%,6.74%,and 4.10%,respectively.Moreover,the proposed model outperforms related methodologies applied to these datasets in recent years.This illustrates that the advanced bimodal fusion model can effectively focus on emotional information,thereby enhancing the overall accuracy of emotion recognition.

关 键 词:情感识别 注意力机制 多分支卷积 残差混合 决策融合 

分 类 号:TN911.7[电子电信—通信与信息系统] TP391.41[电子电信—信息与通信工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象