检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:赵健[1] 崔骞 石佳 刘岳 ZHAO Jian;CUI Qian;SHI Jia;LIU Yue(School of Information Science and Technology,Northwest University,Xi′an 710127,Shaanxi,China)
机构地区:[1]西北大学信息科学与技术学院,陕西西安710127
出 处:《计算机工程》2024年第11期49-58,共10页Computer Engineering
基 金:陕西省国际科技合作计划项目(2021KWZ-07)。
摘 要:在抑郁症诊断中,抑郁症患者的面部表情、声音信号和文字等数据可以作为评估抑郁倾向的客观指标。相较于视频,文本和音频模态在处理敏感的个人信息时能更好地保护患者的隐私,并且文本和音频均属于语言模态,相关性较强。针对抑郁倾向识别中变长文本数据不易被分析以及手动提取音频特征存在局限性的问题,提出一种基于Transformer的融合网络优化方法。对于文本模态,使用卷积神经网络对文本进行特征提取,得到文本在不同尺度下的局部特征,然后引入Transformer模型来处理全局信息和长距离依赖。对于音频模态,为了降低手动提取音频特征对识别结果的影响,通过使用VGGish网络来自动提取音频特征,并将提取好的音频特征送入Transformer中。最后,为进一步增强文本和音频模态融合网络的识别性能,引入SE通道注意力机制,使模型能够自适应地调整各模态之间的权重分配,更有效地聚焦于关键特征。实验结果表明,双模态融合后的网络准确率达到92.7%,相比仅使用文本或音频模态,准确率分别提升2.9和4.9个百分点。In the realm of depression diagnosis,different types of data sources,such as facial expressions,voice signals,and written content from individuals displaying symptoms of depression,can be used as objective indicators to assess inclinations toward depression.For this task,text and audio methods offer advantages over video methods in protecting patient confidentiality while handling sensitive personal information.Furthermore,both text and audio methods are language-based and exhibit strong connections.The present study proposes a technique to address the challenges associated with analyzing text data of varying lengths and manually extracting audio features to detect signs of depression.This approach involves optimizing a Transformer-based hybrid network that fuses a Convolutional Neural Network(CNN),which extracts features from text data and captures local features at different scales,with a Transformer model,which handles global information and long-range dependencies.For audio data,a VGGish network is used to automatically extract audio features,minimizing the impact of manually extracted features on recognition outcomes.The extracted audio features are subsequently input into the Transformer.To improve the efficiency of the fusion network,an SE channel attention mechanism is introduced,enabling the adaptive adjustment of the weight distribution between methods and enhancing the focus on crucial features.Experimental results show that the bimodal fusion network reaches an accuracy of 92.7%,indicating an enhancement of 2.9 and 4.9 percentage points compared with the individual use of text and audio methods,respectively.
关 键 词:Transformer模型 VGGish网络 双模态融合 抑郁倾向识别 SE通道注意力机制 深度学习
分 类 号:TP183[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49