基于音视频特征融合的情感识别方法研究  被引量:2

Research on emotion recognition method based on audio and video feature fusion

在线阅读下载全文

作  者:帖云 程慧杰 靳聪[2] 李小兵[3] 齐林[1] TIE Yun;CHENG Huijie;JIN Cong;LI Xiaobing;QI Lin(School of Information Engineering,Zhengzhou University,Zhengzhou 450001,China;School of Information and Communication Engineering,Communication University of China,Beijing 100024,China;Central Conservatory of Music,Beijing 100031,China)

机构地区:[1]郑州大学信息工程学院,郑州450001 [2]中国传媒大学信息与通信工程学院,北京100024 [3]中央音乐学院,北京100031

出  处:《重庆理工大学学报(自然科学)》2022年第1期120-127,共8页Journal of Chongqing University of Technology:Natural Science

基  金:国家自然科学基金项目(61631016);国家重点研发计划项目(2018YFB1403900);中国传媒大学中央高校基本科研业务费专项(CUC200B017)。

摘  要:传统的视频情感识别工作主要集中在面部表情、人体的动作行为等,忽略了场景和对象中包含大量的情感线索及不同对象之间的情感关联。因此,提出了一个基于视觉关系推理和跨模态信息学习的音视频特征融合网络模型用于预测视频情感。模型主要包括三部分:对象间的情感关系推理、声学特征提取、跨模态交互融合。首先,采用Mask R-CNN模型提取出包含物体的区域并提取出相应的特征序列,利用图注意力网络对视频帧中的不同区域之间的情感关联进行推理,找到视频帧中的关键区域;然后,利用双向长短时记忆网络提取对数梅尔频谱片段的帧级上下文信息,对视觉信息进行补充;最后,将多头注意力机制应用到跨模态交互融合模块中去学习不同模态信息之间的隐藏关联,并将利用跨模态注意得到的音视频特征利用门控神经网络进行融合。所提出的模型在数据集Video Emotion-8和Ekman上具有较好的精确度。Due to the highly diversified scenes and objects in user-generated videos and the sparse emotional expression,video emotion recognition has always been full of challenges.Traditional video emotion recognition works mainly focus on facial expressions,human action behaviors and so on,but it ignores scenes and objects that contain abundant emotion clues and emotional relationships between different objects.Therefore,this paper proposes an audio and video feature fusion network model based on visual relationship reasoning and cross-modal information learning to predict video emotions.The model mainly includes three parts:emotional relationship reasoning between objects,acoustic feature extraction and cross-modal interaction fusion.Firstly,this paper uses the Mask R-CNN model to extract the regions containing objects and the corresponding feature sequences.The graph attention network is used to reason about the emotional relationship between different regions in video frames,and find the key regions in video frames;Then,the Bi-directional Long Short-Term Memory(Bi-LSTM)is used to extract frame-level context information of Log-mel spectrum segments to supplement visual features.Finally,the Multi-Head Attention mechanism is applied to the cross-modal interaction fusion module learning hidden associations between different modal information,and the audio and visual features obtained from cross-modal attention are fused by gating fusion network.The proposed model shows better performance than other several benchmarks on datasets Video Emotion-8 and Ekman.

关 键 词:情感识别 情感关系推理 跨模态交互 图卷积神经网络 多头注意力机制 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象