检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:周晓彦[1] 王丽丽 邵勇斌 鞠醒 ZHOU Xiaoyan;WANG Lili;SHAO Yongbin;JU Xing(School of Electronic and Information Engineering,Nanjing University of Information Science and Technology,Nanjing 210044,Jiangsu,China)
机构地区:[1]南京信息工程大学电子与信息工程学院,江苏南京210044
出 处:《声学技术》2024年第6期854-861,共8页Technical Acoustics
摘 要:针对语音情感识别中判别性的情感特征提取难题,结合卷积神经网络和视觉transformer网络结构,提出一种双通道特征融合的语音表征方法。使用基于倒瓶颈结构的卷积模块通道,并引入类transformer训练策略提取局部频谱特征,通过改进视觉transformer提取全局序列特征,利用卷积神经网络直接提取整个语谱图代替分块部分,更好地提取时序信息,将提取到的特征信息进行融合,能够获取判别性强的情感特征,最后输入到Softmax分类器得到识别结果。在EMO-DB和CASIA数据库上进行实验,文中所提模型的平均准确率分别达到了94.24%和93.05%,与其他模型进行对比试验,结果优于其他模型,表明了该方法的有效性。To address the problems of discriminative emotional feature extraction in speech emotion recognition,a speech representation method based on two-channel feature fusion is proposed by combining convolutional neural network and vision transformer network structure.The convolutional module channel based on the inverted bottleneck structure is introduced into a transformer like training strategy to extract local spectral features.The global sequence features are extracted by improving the vision transformer,and the whole speech spectrogram is directly extracted instead of the chunked part by using a convolutional neural network for better extraction of the temporal information,and the extracted feature information is fused to obtain strong discriminant emotion features,which are finally input to the Softmax classifier to get recognition results.Experiments on EMO-DB and CASIA databases show that the average accuracy of the modle propsed in this paper is 94.24%and 93.05%,respectively.Compared with other models,the results are better,indicating the effectiveness of the methods.
关 键 词:语音情感识别 卷积神经网络 视觉transformer 特征融合
分 类 号:TN912.34[电子电信—通信与信息系统]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49