Improved Speech Emotion Recognition Focusing on High-Level Data Representations and Swift Feature Extraction Calculation  

在线阅读下载全文

作  者:Akmalbek Abdusalomov Alpamis Kutlimuratov Rashid Nasimov Taeg Keun Whangbo 

机构地区:[1]Department of Computer Engineering,Gachon University,Sujeong-Gu,Seongnam-Si,Gyeonggi-Do,13120,Korea [2]Department of AI.Software,Gachon University,Seongnam-Si,13120,Korea [3]Department of Artificial Intelligence,Tashkent State University of Economics,Tashkent,100066,Uzbekistan

出  处:《Computers, Materials & Continua》2023年第12期2915-2933,共19页计算机、材料和连续体(英文)

基  金:supported by the GRRC program of Gyeonggi Province(GRRC-Gachon2023(B02),Development of AI-based medical service technology).

摘  要:The performance of a speech emotion recognition(SER)system is heavily influenced by the efficacy of its feature extraction techniques.The study was designed to advance the field of SER by optimizing feature extraction tech-niques,specifically through the incorporation of high-resolution Mel-spectrograms and the expedited calculation of Mel Frequency Cepstral Coefficients(MFCC).This initiative aimed to refine the system’s accuracy by identifying and mitigating the shortcomings commonly found in current approaches.Ultimately,the primary objective was to elevate both the intricacy and effectiveness of our SER model,with a focus on augmenting its proficiency in the accurate identification of emotions in spoken language.The research employed a dual-strategy approach for feature extraction.Firstly,a rapid computation technique for MFCC was implemented and integrated with a Bi-LSTM layer to optimize the encoding of MFCC features.Secondly,a pretrained ResNet model was utilized in conjunction with feature Stats pooling and dense layers for the effective encoding of Mel-spectrogram attributes.These two sets of features underwent separate processing before being combined in a Convolutional Neural Network(CNN)outfitted with a dense layer,with the aim of enhancing their representational richness.The model was rigorously evaluated using two prominent databases:CMU-MOSEI and RAVDESS.Notable findings include an accuracy rate of 93.2%on the CMU-MOSEI database and 95.3%on the RAVDESS database.Such exceptional performance underscores the efficacy of this innovative approach,which not only meets but also exceeds the accuracy benchmarks established by traditional models in the field of speech emotion recognition.

关 键 词:Feature extraction MFCC ResNet speech emotion recognition 

分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象