利用可交谈多头共注意力机制的视觉问答  

Talking Co-attention Networks for Visual Question Answering

在线阅读下载全文

作  者:杨旭华[1] 庞宇超 叶蕾[1] YANG Xuhua;PANG Yuchao;YE Lei(College of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310023,China)

机构地区:[1]浙江工业大学计算机科学与技术学院,杭州310023

出  处:《小型微型计算机系统》2024年第8期1901-1907,共7页Journal of Chinese Computer Systems

基  金:国家自然科学基金项目(62176236)资助.

摘  要:视觉问答可以对图像信息和自然语言问题这两种不同模态的信息进行分析处理并预测答案,是一项跨模态学习任务.当前注意力机制因为其良好的关键信息提取效果被广泛地用以捕捉视觉图像、文本和两种模态间的关系.但是,传统的注意力机制容易忽略图像和文本的自相关信息,而且不能较好的利用图像和文本的信息差异性.因此,在本文中,我们提出了可交谈的多头共注意力网络框架来处理注意力机制的上述问题.首先,本文提出了可交谈多头注意力机制来捕捉不同注意力头之间隐藏的关系,得到增强的注意力信息.本文设计了前后不同的交谈策略去处理归一化前后注意力头之间的信息,在引入先验信息的同时减少了过拟合的风险.本文提出了交谈自注意力单元和交谈引导注意力单元,并使用编码器-解码器方式有效地组合它们来丰富视觉和文本表征.该框架针对自注意力层增加了位置编码,弥补了交谈自注意力无法捕获位置的问题,此框架使用不同的注意力策略去分别得到图像和文本向量,并使用新的多模态融合模块来更好的融合图像和文本信息,降低了对单个信息的依赖性.该模型在VQA-v2数据集上和多个知名算法进行比较,数值仿真实验表明提出的算法具有明显的优越性.Visual Question Answering is a multi-modal learning task.Its input is image information and open questions expressed in natural language,and the output is the predicted answer using the information from these two different modalities.Currently,attention mechanisms are widely used to capture intra-modal and inter-modal relationships in visual images and texts due to its good performance in key information extraction.However,traditional attention mechanisms tend to ignore the auto correlation information of images and texts,and cannot make good use of the information differences between different modalities.In this paper,we propose Talking Co-Attention Networks(T-CAN)for Visual Question Answering.First,this paper propose a talking multi-head attention mechanism to capture the hidden relationship between different attention heads,so as to obtain enhanced attention information.Ourpaper design different talking strategies to process the information between the attention heads before and after the SoftMax layer,which not only introduce prior information but also reduce the risk of over fitting.Our paper propose talking self-attention unit(T-SA)and talking guided attention unit(T-GA),and then combine them efficiently using an encoder-decoder approach to enrich text and visual representations.Finally,different attention strategies are used to obtain image and text vectors separately.And a new multi-modal fusion module is used to better combine image and text information,and reduce the dependence on individual information.The model is compared with several well-known algorithms on the VQA-v2 datasets,and the numerical simulation experiments show that the proposed algorithm has obvious advantages.

关 键 词:视觉问答 特征提取 交谈注意力 多模态特征融合 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象