基于文本和声学特征的双模态融合抑郁倾向识别算法

Bimodal Fused Depressive Tendency Recognition Algorithm Based on Textual and Acoustic Features

作　　者：赵健[1] 崔骞石佳刘岳 ZHAO Jian;CUI Qian;SHI Jia;LIU Yue(School of Information Science and Technology,Northwest University,Xi′an 710127,Shaanxi,China)

机构地区：[1]西北大学信息科学与技术学院,陕西西安710127

出　　处：《计算机工程》2024年第11期49-58,共10页Computer Engineering

基　　金：陕西省国际科技合作计划项目(2021KWZ-07)。

摘　　要：在抑郁症诊断中,抑郁症患者的面部表情、声音信号和文字等数据可以作为评估抑郁倾向的客观指标。相较于视频,文本和音频模态在处理敏感的个人信息时能更好地保护患者的隐私,并且文本和音频均属于语言模态,相关性较强。针对抑郁倾向识别中变长文本数据不易被分析以及手动提取音频特征存在局限性的问题,提出一种基于Transformer的融合网络优化方法。对于文本模态,使用卷积神经网络对文本进行特征提取,得到文本在不同尺度下的局部特征,然后引入Transformer模型来处理全局信息和长距离依赖。对于音频模态,为了降低手动提取音频特征对识别结果的影响,通过使用VGGish网络来自动提取音频特征,并将提取好的音频特征送入Transformer中。最后,为进一步增强文本和音频模态融合网络的识别性能,引入SE通道注意力机制,使模型能够自适应地调整各模态之间的权重分配,更有效地聚焦于关键特征。实验结果表明,双模态融合后的网络准确率达到92.7%,相比仅使用文本或音频模态,准确率分别提升2.9和4.9个百分点。In the realm of depression diagnosis,different types of data sources,such as facial expressions,voice signals,and written content from individuals displaying symptoms of depression,can be used as objective indicators to assess inclinations toward depression.For this task,text and audio methods offer advantages over video methods in protecting patient confidentiality while handling sensitive personal information.Furthermore,both text and audio methods are language-based and exhibit strong connections.The present study proposes a technique to address the challenges associated with analyzing text data of varying lengths and manually extracting audio features to detect signs of depression.This approach involves optimizing a Transformer-based hybrid network that fuses a Convolutional Neural Network(CNN),which extracts features from text data and captures local features at different scales,with a Transformer model,which handles global information and long-range dependencies.For audio data,a VGGish network is used to automatically extract audio features,minimizing the impact of manually extracted features on recognition outcomes.The extracted audio features are subsequently input into the Transformer.To improve the efficiency of the fusion network,an SE channel attention mechanism is introduced,enabling the adaptive adjustment of the weight distribution between methods and enhancing the focus on crucial features.Experimental results show that the bimodal fusion network reaches an accuracy of 92.7%,indicating an enhancement of 2.9 and 4.9 percentage points compared with the individual use of text and audio methods,respectively.

关键词：Transformer模型 VGGish网络双模态融合抑郁倾向识别 SE通道注意力机制深度学习

分类号：TP183[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于文本和声学特征的双模态融合抑郁倾向识别算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于文本和声学特征的双模态融合抑郁倾向识别算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索