融合场景语义与空间关系的视觉问答被引量：1

Visual Question Answering based on Scene Semantic Relation and Spatial Relation

作　　者：邹品荣肖锋[2] 张文娟黄姝娟[2] 张万玉 ZOU Pinrong;XIAO Feng;ZHANG Wenjuan;HUANG Shujuan;ZHANG Wanyu(School of Defence Science and Technology,Xi’an Technological University,Xi’an 710021,China;School of Computer Science and Engineering,Xi’an Technological University,Xi’an 710021,China;School of Sciences,Xi’an Technological University,Xi’an 710021,China)

机构地区：[1]西安工业大学兵器科学与技术学院,西安710021 [2]西安工业大学计算机科学与工程学院,西安710021 [3]西安工业大学基础学院,西安710021

出　　处：《西安工业大学学报》2023年第1期56-65,共10页Journal of Xi’an Technological University

基　　金：国家自然科学基金项目(62171361);陕西省科技计划项目(2020GY 066);陕西省自然科学基础研究项目(2021JM 440);未央区科技计划项目(201925)。

摘　　要：为了捕捉问答场景下更深层次的关系语义与增强网络的可解释性,文中提出一种显式融合场景语义与空间关系的视觉问答模型,利用视觉对象间关系及其属性来生成关系图表示。根据图像中检测到的视觉对象关系和空间位置信息来构建图网络;分别通过自适应问题的图注意力机制编码预定义的场景语义关系和空间对象关系,以学习先验知识下的多模特征表示;将两种关系模型进行线性融合来推理问题答案。研究结果表明:在数据集VQA 2.0上进行实验,与视觉问答算法模型BUTD,DA-NTN,ODA-GCN,Scence GCN,VCTREE-HL和MuRel对比,分别提升测试子集test-dev的准确率4.12%,1.88%,2.77%,2.63%,1.25%和1.41%。该模型能在问题引导下对视觉语义关系进行推理,有效提升视觉问答的准确率。In order to entirely capture the visual semantics in the scenarios of visual and answering,this paper proposes a novel visual question answering(VQA)model,named Scenario Relationship Network(SRN).The proposed model is able to generate representations of relation aware graphs by employing visual content and its properties.First,a graph network was constructed based on the visual object relationship and spatial position information which were detected in the image.Second,pre defined scene semantic relations and spatial object relations were modeled by the graph attention mechanism of the adaptation problem to learn multimodal feature representations with prior knowledge.Finally,the linear fusion of the two relational models were used to infer answers.Experiments were conducted on the large scale datasets for VQA 2.0 and this method was compared with the mainstream VQA models:BUTD,DA NTN,ODA GCN,Scence GCN,VCTREE HL and MuRel.The results show that the test dev accuracy is improved by 4.12%,1.88%,2.77%,2.63%,1.25%and 1.41%.It is concluded that the proposed algorithmic model can infer visual semantic relationships under the guidance of questions and can effectively improve the accuracy of visual question answering.

关键词：视觉问答注意力机制语义关系空间关系关系编码

分类号：TP391.41[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合场景语义与空间关系的视觉问答被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合场景语义与空间关系的视觉问答 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

融合场景语义与空间关系的视觉问答被引量：1