检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:邹品荣 肖锋[2] 张文娟 黄姝娟[2] 张万玉 ZOU Pinrong;XIAO Feng;ZHANG Wenjuan;HUANG Shujuan;ZHANG Wanyu(School of Defence Science and Technology,Xi’an Technological University,Xi’an 710021,China;School of Computer Science and Engineering,Xi’an Technological University,Xi’an 710021,China;School of Sciences,Xi’an Technological University,Xi’an 710021,China)
机构地区:[1]西安工业大学兵器科学与技术学院,西安710021 [2]西安工业大学计算机科学与工程学院,西安710021 [3]西安工业大学基础学院,西安710021
出 处:《西安工业大学学报》2023年第1期56-65,共10页Journal of Xi’an Technological University
基 金:国家自然科学基金项目(62171361);陕西省科技计划项目(2020GY 066);陕西省自然科学基础研究项目(2021JM 440);未央区科技计划项目(201925)。
摘 要:为了捕捉问答场景下更深层次的关系语义与增强网络的可解释性,文中提出一种显式融合场景语义与空间关系的视觉问答模型,利用视觉对象间关系及其属性来生成关系图表示。根据图像中检测到的视觉对象关系和空间位置信息来构建图网络;分别通过自适应问题的图注意力机制编码预定义的场景语义关系和空间对象关系,以学习先验知识下的多模特征表示;将两种关系模型进行线性融合来推理问题答案。研究结果表明:在数据集VQA 2.0上进行实验,与视觉问答算法模型BUTD,DA-NTN,ODA-GCN,Scence GCN,VCTREE-HL和MuRel对比,分别提升测试子集test-dev的准确率4.12%,1.88%,2.77%,2.63%,1.25%和1.41%。该模型能在问题引导下对视觉语义关系进行推理,有效提升视觉问答的准确率。In order to entirely capture the visual semantics in the scenarios of visual and answering,this paper proposes a novel visual question answering(VQA)model,named Scenario Relationship Network(SRN).The proposed model is able to generate representations of relation aware graphs by employing visual content and its properties.First,a graph network was constructed based on the visual object relationship and spatial position information which were detected in the image.Second,pre defined scene semantic relations and spatial object relations were modeled by the graph attention mechanism of the adaptation problem to learn multimodal feature representations with prior knowledge.Finally,the linear fusion of the two relational models were used to infer answers.Experiments were conducted on the large scale datasets for VQA 2.0 and this method was compared with the mainstream VQA models:BUTD,DA NTN,ODA GCN,Scence GCN,VCTREE HL and MuRel.The results show that the test dev accuracy is improved by 4.12%,1.88%,2.77%,2.63%,1.25%and 1.41%.It is concluded that the proposed algorithmic model can infer visual semantic relationships under the guidance of questions and can effectively improve the accuracy of visual question answering.
关 键 词:视觉问答 注意力机制 语义关系 空间关系 关系编码
分 类 号:TP391.41[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7