基于多模态推理图神经网络的场景文本视觉问答模型

Visual question answering model of vision and scene text based on multi-modal reasoning graph neural network

作　　者：张海涛[1] 郭欣雨 Zhang Haitao;Guo Xinyu(School of Software,Liaoning Technical University,Huludao Liaoning 125105,China)

机构地区：[1]辽宁工程技术大学软件学院,辽宁葫芦岛125105

出　　处：《计算机应用研究》2022年第1期280-284,302,共6页Application Research of Computers

基　　金：辽宁省自然科学基金面上项目;中国人民解放军总装备部装备预研基金项目。

摘　　要：文本阅读能力差和视觉推理能力不足是现有视觉问答(visual question answering,VQA)模型效果不好的主要原因,针对以上问题,设计了一个基于图神经网络的多模态推理(multi-modal reasoning graph neural network,MRGNN)模型。利用图像中多种形式的信息帮助理解场景文本内容,将场景文本图片分别预处理成视觉对象图和文本图的形式,并且在问题自注意力模块下过滤多余的信息;使用加入注意力的聚合器完善子图之间相互的节点特征,从而融合不同模态之间的信息,更新后的节点利用不同模态的上下文信息为答疑模块提供了更好的功能。在ST-VQA和TextVQA数据集上验证了有效性,实验结果表明,相比较此任务的一些其他模型,MRGNN模型在此任务上有明显的提升。Poor text reading ability and inadequate visual reasoning were the main reasons for the insufficient effect of existing visual question answering models.To solve the above problems,this paper designed a MRGNN model.It used various forms of information in images to help understanding the scene text content,preprocessed the scene text image into the visual object graph and text graph respectively,and filtered the redundant information in the question self-attention module.It used an aggregator with attention to perfect the node features between subgraphs and fuse different modality information.The updated nodes used the context information of different modules to provide a better function for answering module.This paper verified the validity of MRGNN model on ST-VQA and TextVQA datasets.The experimental results show that MRGNN model achieves good results compared with some classical models for this task.

关键词：视觉问答图神经网络多模态推理问题自注意力

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多模态推理图神经网络的场景文本视觉问答模型

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多模态推理图神经网络的场景文本视觉问答模型

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索