基于多语义关联与融合的视觉问答模型

Visual question answering model based on association and fusion of multiple semantic features

作　　者：周浩[1] 王超[1] 崔国恒[1] 罗廷金 ZHOU Hao;WANG Chao;CUI Guoheng;LUO Tingjin(Department of Operational Research and Planning,Naval University of Engineering,Wuhan Hubei 430033,China;College of Science,National University of Defense Technology,Changsha Hunan 410073,China)

机构地区：[1]海军工程大学作战运筹与规划系,武汉430033 [2]国防科技大学理学院,长沙410073

出　　处：《计算机应用》2025年第3期739-745,共7页journal of Computer Applications

基　　金：国家自然科学基金资助项目(62302516,62376281);湖北省自然科学基金资助项目(2022CFC049);湖南省湖湘青年人才项目(2021RC3070)。

摘　　要：弥合视觉图像和文本问题之间的语义差异是提高视觉问答(VQA)模型推理准确性的重要方法之一。然而现有的相关模型大多数基于低层图像特征的提取并利用注意力机制推理问题的答案,忽略了高层图像语义特征如关系和属性特征等在视觉推理中的作用。为解决上述问题,提出一种基于多语义关联与融合的VQA模型以建立问题与图像之间的语义联系。首先,基于场景图生成框架提取图像中的多种语义并把它们进行特征精炼后作为VQA模型的特征输入,从而充分挖掘图像场景中的信息;其次,为提高图像特征的语义价值,设计一个信息过滤器过滤图像特征中的噪声和冗余信息;最后,设计多层注意力融合和推理模块将多种图像语义分别与问题特征进行语义融合,以强化视觉图像重点区域与文本问题之间的语义关联。与BAN(Bilinear Attention Network)和CFR(Coarse-to-Fine Reasoning)模型的对比实验结果表明,所提模型在VQA2.0测试集上的准确率分别提高了2.9和0.4个百分点,在GQA测试集上的准确率分别提高了17.2和0.3个百分点。这表明所提模型能够更好地理解图像场景中的语义并回答组合式视觉问题。Bridging the semantic gaps among visual images and text-based questions is the key to improve the reasoning accuracy of Visual Question Answering(VQA)models.However,most the existing related models rely on extracting lowlevel image features and using attention mechanisms to reason and obtain answers of questions,while ignoring the important role of high-level image semantic features in visual reasoning,such as relationship features and attribute features.In order to solve the above problems,a VQA model based on multi-semantic association and fusion was proposed to establish semantic association among questions and images.Firstly,based on scene graph generation framework,multiple semantic features in images were extracted and refined as the feature input of VQA model to fully explore the information in visual scenes.Secondly,to enhance the semantic value of image features,an information filter was designed to remove noise and redundant information in the image features.Finally,a multi-layer attention fusion and reasoning module was designed to fuse multiple image semantics with question features,respectively,and strengthen the semantic association among the important regions of images and the questions.Experimental results show that compared with Bilinear Attention Network(BAN)and Coarse-to-Fine Reasoning(CFR)models,the proposed model has the accuracy on VQA2.0 test set increased by 2.9 and 0.4 percentage points respectively,and the accuracy on GQA test set increased by 17.2 and 0.3 percentage points respectively,demonstrating that the proposed model can better understand the semantics in image scenes and answer compositional visual questions.

关键词：多语义特征融合视觉问答场景图属性注意力关系注意力

分类号：TP391[自动化与计算机技术—计算机应用技术] TP18[自动化与计算机技术—计算机科学与技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多语义关联与融合的视觉问答模型

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多语义关联与融合的视觉问答模型

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索