An autoencoder-based feature level fusion for speech emotion recognition  

在线阅读下载全文

作  者:Peng Shixin Chen Kai Tian Tian Chen Jingying 

机构地区:[1]National Engineering Research Center for E-Learning,National Engineering Laboratory for Educational Big Data,Central China Normal University,Hubei,430079,China

出  处:《Digital Communications and Networks》2024年第5期1341-1351,共11页数字通信与网络(英文版)

基  金:funded in part by the MOE(Ministry of Education in China)Project of Humanities and Social Sciences(No.19YJC880068);the Hubei Provincial Natural Science Foundation of China(No.2019CFB347);the China Postdoctoral Science Foundation(No.2018M632889,No.2022T150250);the National Natural Science Foundation of China(No.61977027);the Hubei Province Technological Innovation Major Project Q2(No.2019AAA044);the Science&Technology Major Project of Hubei Province Next-Generation AI Technologies(No.2021BEA159);the Research Funds of CCNU from the Colleges'Basic Research and Operation of MOE(No.30106220491);in party by the Key Program of National Natural Science Foundation of China(No.61937001)。

摘  要:Although speech emotion recognition is challenging,it has broad application prospects in human-computer interaction.Building a system that can accurately and stably recognize emotions from human languages can provide a better user experience.However,the current unimodal emotion feature representations are not distinctive enough to accomplish the recognition,and they do not effectively simulate the inter-modality dynamics in speech emotion recognition tasks.This paper proposes a multimodal method that utilizes both audio and semantic content for speech emotion recognition.The proposed method consists of three parts:two high-level feature extractors for text and audio modalities,and an autoencoder-based feature fusion.For audio modality,we propose a structure called Temporal Global Feature Extractor(TGFE)to extract the high-level features of the timefrequency domain relationship from the original speech signal.Considering that text lacks frequency information,we use only a Bidirectional Long Short-Term Memory network(BLSTM)and attention mechanism to simulate an intra-modal dynamic.Once these steps have been accomplished,the high-level text and audio features are sent to the autoencoder in parallel to learn their shared representation for final emotion classification.We conducted extensive experiments on three public benchmark datasets to evaluate our method.The results on Interactive Emotional Motion Capture(IEMOCAP)and Multimodal EmotionLines Dataset(MELD)outperform the existing method.Additionally,the results of CMU Multi-modal Opinion-level Sentiment Intensity(CMU-MOSI)are competitive.Furthermore,experimental results show that compared to unimodal information and autoencoderbased feature level fusion,the joint multimodal information(audio and text)improves the overall performance and can achieve greater accuracy than simple feature concatenation.

关 键 词:Attention mechanism Autoencoder Bimodal fusion Emotion recognition 

分 类 号:TN9[电子电信—信息与通信工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象