检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:薛珮芸 戴书涛 白静[1] 高翔[1] XUE Peiyun;DAI Shutao;BAI Jing;GAO Xiang(Taiyuan University of Technology,College of Electronic Information and Optical Engineering,Taiyuan 030024,China;Shanxi Advanced Innovation Research Institute,Postdoctoral orkstation,Taiyuan 030032,China)
机构地区:[1]太原理工大学电子信息与光学工程学院,太原030024 [2]山西省高等创新研究院博士后工作站,太原030032
出 处:《电子与信息学报》2024年第12期4542-4552,共11页Journal of Electronics & Information Technology
基 金:山西省青年基金(20210302124544);山西省应用基础研究计划(201901D111094)。
摘 要:为提升情感识别模型的准确率,解决情感特征提取不充分的问题,对语音和面部图像的双模态情感识别进行研究。语音模态提出一种结合通道-空间注意力机制的多分支卷积神经网络(Multi-branch Convolutional Neural Networks, MCNN)的特征提取模型,在时间、空间和局部特征维度对语音频谱图提取情感特征;面部图像模态提出一种残差混合卷积神经网络(Residual Hybrid Convolutional Neural Network, RHCNN)的特征提取模型,进一步建立并行注意力机制关注全局情感特征,提高识别准确率;将提取到的语音和面部图像特征分别通过分类层进行分类识别,并使用决策融合对识别结果进行最终的融合分类。实验结果表明,所提双模态融合模型在RAVDESS, eNTERFACE’05, RML三个数据集上的识别准确率分别达到了97.22%, 94.78%和96.96%,比语音单模态的识别准确率分别提升了11.02%, 4.24%, 8.83%,比面部图像单模态的识别准确率分别提升了4.60%, 6.74%,4.10%,且与近年来对应数据集上的相关方法相比均有所提升。说明了所提的双模态融合模型能有效聚焦情感信息,从而提升情感识别的准确率。In order to improve the accuracy of emotion recognition models and solve the problem of insufficient emotional feature extraction,this paper conducts research on bimodal emotion recognition involving audio and facial imagery.In the audio modality,a feature extraction model of a Multi-branch Convolutional Neural Network(MCNN)incorporating a channel-space attention mechanism is proposed,which extracts emotional features from speech spectrograms across time,space,and local feature dimensions.For the facial image modality,a feature extraction model using a Residual Hybrid Convolutional Neural Network(RHCNN)is introduced,which further establishes a parallel attention mechanism that concentrates on global emotional features to enhance recognition accuracy.The emotional features extracted from audio and facial imagery are then classified through separate classification layers,and a decision fusion technique is utilized to amalgamate the classification results.The experimental results indicate that the proposed bimodal fusion model has achieved recognition accuracies of 97.22%,94.78%,and 96.96%on the RAVDESS,eNTERFACE’05,and RML datasets,respectively.These accuracies signify improvements over single-modality audio recognition by 11.02%,4.24%,and 8.83%,and single-modality facial image recognition by 4.60%,6.74%,and 4.10%,respectively.Moreover,the proposed model outperforms related methodologies applied to these datasets in recent years.This illustrates that the advanced bimodal fusion model can effectively focus on emotional information,thereby enhancing the overall accuracy of emotion recognition.
关 键 词:情感识别 注意力机制 多分支卷积 残差混合 决策融合
分 类 号:TN911.7[电子电信—通信与信息系统] TP391.41[电子电信—信息与通信工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.78