面向遥感视觉问答的尺度引导融合推理网络

Scale-guided Fusion Inference Network for Remote Sensing Visual Question Answering

作　　者：赵恩源宋宁聂婕[1] 王鑫郑程予魏志强[1,3] ZHAO En-Yuan;SONG Ning;NIE Jie;WANG Xin;ZHENG Cheng-Yu;WEI Zhi-Qiang(Faculty of Information Science and Engineering,Ocean University of China,Qingdao 266100,China;Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China;Qingdao Marine Science and Technology Center,Qingdao 266061,China)

机构地区：[1]中国海洋大学信息科学与工程学部,山东青岛266100 [2]清华大学计算机科学与技术系,北京100084 [3]青岛海洋科技中心,山东青岛266061

出　　处：《软件学报》2024年第5期2133-2149,共17页Journal of Software

基　　金：国家重点研发计划(2021YFF0704000);国家自然科学基金(62172376);国家自然科学基金区域创新发展联合基金(U22A2068);中央引导地方科技发展专项资金(YDZX2022028)。

摘　　要：遥感视觉问答(remote sensing visual question answering,RSVQA)旨在从遥感图像中抽取科学知识.近年来,为了弥合遥感视觉信息与自然语言之间的语义鸿沟,涌现出许多方法.但目前方法仅考虑多模态信息的对齐和融合,既忽略了对遥感图像目标中的多尺度特征及其空间位置信息的深度挖掘,又缺乏对尺度特征的建模和推理的研究,导致答案预测不够全面和准确.针对以上问题,提出一种多尺度引导的融合推理网络(multi-scale guided fusion inference network,MGFIN),旨在增强RSVQA系统的视觉空间推理能力.首先,设计基于Swin Transformer的多尺度视觉表征模块,对嵌入空间位置信息的多尺度视觉特征进行编码;其次,在语言线索的引导下,使用多尺度关系推理模块以尺度空间为线索学习跨多个尺度的高阶群内对象关系,并进行空间层次推理;最后,设计基于推理的融合模块来弥合多模态语义鸿沟,在交叉注意力基础上,通过自监督范式、对比学习方法、图文匹配机制等训练目标来自适应地对齐融合多模态特征,并辅助预测最终答案.实验结果表明,所提模型在两个公共RSVQA数据集上具有显著优势.Remote sensing visual question answering(RSVQA)aims to extract scientific knowledge from remote sensing images.In recent years,many methods have emerged to bridge the semantic gap between remote sensing visual information and natural language.However,most of these methods only consider the alignment and fusion of multimodal information,ignoring the deep mining of multi-scale features and their spatial location information in remote sensing image objects and lacking research on modeling and reasoning about scale features,thus resulting in incomplete and inaccurate answer prediction.To address these issues,this study proposes a multi-scale-guided fusion inference network(MGFIN),which aims to enhance the visual spatial reasoning ability of RSVQA systems.First,the study designs a multi-scale visual representation module based on Swin Transformer to encode multi-scale visual features embedded with spatial position information.Second,guided by language clues,the study uses a multi-scale relation reasoning module to learn cross-scale higher-order intra-group object relations with scale space as clues and performs spatial hierarchical inference.Finally,this study designs the inference-based fusion module to bridge the multimodal semantic gap.On the basis of cross-attention,training goals such as self-supervised paradigms,contrastive learning methods,and image-text matching mechanisms are used to adaptively align and fuse multimodal features and assist in predicting the final answer.Experimental results show that the proposed model has significant advantages on two public RSVQA datasets.

关键词：遥感视觉问答多模态智能融合多模态推理多尺度表征

分类号：TP18[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向遥感视觉问答的尺度引导融合推理网络

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向遥感视觉问答的尺度引导融合推理网络

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索