基于粒度感知和语义聚合的图像-文本检索网络  被引量:4

Granularity-aware and Semantic Aggregation Based Image-Text Retrieval Network

在线阅读下载全文

作  者:缪岚芯 雷雨 曾鹏鹏 李晓瑜[2] 宋井宽 MIAO Lan-xin;LEI Yu;ZENG Peng-peng;LI Xiao-yu;SONGJing-kuan(School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu 611731,China;School of Informationand Software Engineering,University of Electronic Science and Technology of China,Chengdu 610054,China)

机构地区:[1]电子科技大学计算机科学与工程学院(网络空间安全学院),成都611731 [2]电子科技大学信息与软件工程学院,成都610054

出  处:《计算机科学》2022年第11期134-140,共7页Computer Science

基  金:国家自然科学基金(62122018,61872064)。

摘  要:图像-文本检索是视觉-语言领域中的基本任务,其目的在于挖掘不同模态样本之间的关系,即通过一种模态样本来检索具有近似语义的另一种模态样本。然而,现有方法大多高度依赖于将图像特定区域和句中单词进行相似语义关联,低估了视觉多粒度信息的重要性,导致了错误匹配以及语义模糊嵌入等问题。通常,图片包含了目标级、动作级、关系级以及场景级的粗、细粒度信息,而这些信息无显式多粒度标签,难以与模糊的文本表达直接一一对应。为了解决此问题,提出了一个粒度感知和语义聚合(Granularity-Aware and Semantic Aggregation,GASA)网络,用于获得多粒度视觉特征并缩小文本和视觉之间的语义鸿沟。具体来说,粒度感知的特征选择模块挖掘视觉多粒度信息,并在自适应门控融合机制和金字塔空洞卷积结构的引导下进行了多尺度融合。语义聚合模块在一个共享空间中对来自视觉和文本的多粒度信息进行聚类,以获得局部表征。模型在两个基准数据集上进行了实验,在MSCOCO 1k上R@1优于最先进的技术2%以上,在Flickr30K上R@Sum优于之前最先进的技术4.1%。Image-text retrieval is a basic task in visual-language domain,which aims at mining the relationships between different modalities.However,most existing approaches rely heavily on associating specific regions of an image with each word in a sentence with similar semantics and underappreciate the significance of multi-granular information in images,resulting in irrelevant matches between the two modalities and semantically ambiguous embedding.Generally,an image contains object-level,action-le-vel,relationship-level or even scene-level information that is not explicitly labeled.Therefore,it is challenging to align complex visual information with ambiguous descriptions.To tackle this issue,this paper proposes a granularity aware and semantic aggregating(GASA)network to obtain multi-visual representations and narrow the cross-modal gap.Specifically,the granularity-aware feature selection module selects copious multi-granularity information of images and conducts a multi-scale fusion,guided by an adaptive gated fusion mechanism and a pyramid structure.The semantic aggregation module clusters the multi-granularity information from visual and textual clues in a shared space to obtain the residual representations.Experiments are conducted on two benchmark datasets,and the results show our model outperforms the state-of-the-arts by over 2%on R@1 of MSCOCO 1 k.Besides,our model outperforms the state-of-the-art by 4.1%in terms of Flickr30 k on R@Sum.

关 键 词:图文匹配 跨模态检索 特征提取 语义聚类 多粒度信息提取 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象