A Multi-Level Circulant Cross-Modal Transformer for Multimodal Speech Emotion Recognition  被引量:1

在线阅读下载全文

作  者:Peizhu Gong Jin Liu Zhongdai Wu Bing Han YKenWang Huihua He 

机构地区:[1]College of Information Engineering,Shanghai Maritime University,Shanghai,201306,China [2]Shanghai Ship and Shipping Research Institute,Shanghai,200135,China [3]Division of Management and Education,University of Pittsburgh,Bradford,USA [4]College of Early Childhood Education,Shanghai Normal University,Shanghai,200234,China

出  处:《Computers, Materials & Continua》2023年第2期4203-4220,共18页计算机、材料和连续体(英文)

基  金:the National Natural Science Foundation of China(No.61872231);the National Key Research and Development Program of China(No.2021YFC2801000);the Major Research plan of the National Social Science Foundation of China(No.2000&ZD130).

摘  要:Speech emotion recognition,as an important component of humancomputer interaction technology,has received increasing attention.Recent studies have treated emotion recognition of speech signals as a multimodal task,due to its inclusion of the semantic features of two different modalities,i.e.,audio and text.However,existing methods often fail in effectively represent features and capture correlations.This paper presents a multi-level circulant cross-modal Transformer(MLCCT)formultimodal speech emotion recognition.The proposed model can be divided into three steps,feature extraction,interaction and fusion.Self-supervised embedding models are introduced for feature extraction,which give a more powerful representation of the original data than those using spectrograms or audio features such as Mel-frequency cepstral coefficients(MFCCs)and low-level descriptors(LLDs).In particular,MLCCT contains two types of feature interaction processes,where a bidirectional Long Short-term Memory(Bi-LSTM)with circulant interaction mechanism is proposed for low-level features,while a two-stream residual cross-modal Transformer block is appliedwhen high-level features are involved.Finally,we choose self-attention blocks for fusion and a fully connected layer to make predictions.To evaluate the performance of our proposed model,comprehensive experiments are conducted on three widely used benchmark datasets including IEMOCAP,MELD and CMU-MOSEI.The competitive results verify the effectiveness of our approach.

关 键 词:Speech emotion recognition self-supervised embedding model cross-modal transformer self-attention 

分 类 号:TN912.34[电子电信—通信与信息系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象