检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:赵恩源 宋宁 聂婕[1] 王鑫 郑程予 魏志强[1,3] ZHAO En-Yuan;SONG Ning;NIE Jie;WANG Xin;ZHENG Cheng-Yu;WEI Zhi-Qiang(Faculty of Information Science and Engineering,Ocean University of China,Qingdao 266100,China;Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China;Qingdao Marine Science and Technology Center,Qingdao 266061,China)
机构地区:[1]中国海洋大学信息科学与工程学部,山东青岛266100 [2]清华大学计算机科学与技术系,北京100084 [3]青岛海洋科技中心,山东青岛266061
出 处:《软件学报》2024年第5期2133-2149,共17页Journal of Software
基 金:国家重点研发计划(2021YFF0704000);国家自然科学基金(62172376);国家自然科学基金区域创新发展联合基金(U22A2068);中央引导地方科技发展专项资金(YDZX2022028)。
摘 要:遥感视觉问答(remote sensing visual question answering,RSVQA)旨在从遥感图像中抽取科学知识.近年来,为了弥合遥感视觉信息与自然语言之间的语义鸿沟,涌现出许多方法.但目前方法仅考虑多模态信息的对齐和融合,既忽略了对遥感图像目标中的多尺度特征及其空间位置信息的深度挖掘,又缺乏对尺度特征的建模和推理的研究,导致答案预测不够全面和准确.针对以上问题,提出一种多尺度引导的融合推理网络(multi-scale guided fusion inference network,MGFIN),旨在增强RSVQA系统的视觉空间推理能力.首先,设计基于Swin Transformer的多尺度视觉表征模块,对嵌入空间位置信息的多尺度视觉特征进行编码;其次,在语言线索的引导下,使用多尺度关系推理模块以尺度空间为线索学习跨多个尺度的高阶群内对象关系,并进行空间层次推理;最后,设计基于推理的融合模块来弥合多模态语义鸿沟,在交叉注意力基础上,通过自监督范式、对比学习方法、图文匹配机制等训练目标来自适应地对齐融合多模态特征,并辅助预测最终答案.实验结果表明,所提模型在两个公共RSVQA数据集上具有显著优势.Remote sensing visual question answering(RSVQA)aims to extract scientific knowledge from remote sensing images.In recent years,many methods have emerged to bridge the semantic gap between remote sensing visual information and natural language.However,most of these methods only consider the alignment and fusion of multimodal information,ignoring the deep mining of multi-scale features and their spatial location information in remote sensing image objects and lacking research on modeling and reasoning about scale features,thus resulting in incomplete and inaccurate answer prediction.To address these issues,this study proposes a multi-scale-guided fusion inference network(MGFIN),which aims to enhance the visual spatial reasoning ability of RSVQA systems.First,the study designs a multi-scale visual representation module based on Swin Transformer to encode multi-scale visual features embedded with spatial position information.Second,guided by language clues,the study uses a multi-scale relation reasoning module to learn cross-scale higher-order intra-group object relations with scale space as clues and performs spatial hierarchical inference.Finally,this study designs the inference-based fusion module to bridge the multimodal semantic gap.On the basis of cross-attention,training goals such as self-supervised paradigms,contrastive learning methods,and image-text matching mechanisms are used to adaptively align and fuse multimodal features and assist in predicting the final answer.Experimental results show that the proposed model has significant advantages on two public RSVQA datasets.
关 键 词:遥感视觉问答 多模态智能融合 多模态推理 多尺度表征
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.62