基于双向语义嵌入的细粒度图文匹配方法

Bidirectional semantic embedding for fine⁃grained image⁃text matching

作　　者：尹晶晶潘丽丽[1] 王朝[1] 熊思宇瞿栋梁 Yin Jingjing;Pan Lili;Wang Chao;Xiong Siyu;Qu Dongliang(School of Electronic Information and Physics,Central South University of Forestry and Technology,Changsha,410004,China)

机构地区：[1]中南林业科技大学电子信息与物理学院,长沙410004

出　　处：《南京大学学报(自然科学版)》2024年第5期804-814,共11页Journal of Nanjing University(Natural Science)

基　　金：湖南省教育厅科学研究重点项目(22A0195);湖南省教育厅教学改革研究项目(HNJG-20230471)

摘　　要：图像-文本匹配旨在实现图像与文本的高质量语义对齐,是计算机视觉与自然语言处理交叉领域的一种重要任务.图像与文本是两种不同的信息载体,其信息内容和数据分布的差异容易造成跨模态细粒度信息关联的不确定和模糊.为了解决上述问题,根据图像-文本对的语义一致性,提出了基于双向语义嵌入的细粒度图文匹配方法(Bidirectional Semantic Embedding for Fine-Grained Image-Text Matching,BSEM-Net),通过图像到文本和文本到图像双向语义嵌入的方式来提升图像和文本细粒度对齐的准确性.第一,为了减少图像信息冗余,构造了图像语义嵌入模块,利用文本单词作为监督信号,引导模型限制不相关图像区域的表达;第二,为了减少模态间信息分布差异,更好地建立细粒度语义对齐,构造了文本语义嵌入模块,利用图像区域选择单词形成集合体,进而转化为与图像区域信息分布相似的短语.此外,两个模块分别利用图像区域关系连通图和短语关系连通图挖掘模态内特征之间的上下文信息,减少语义发散.在公开的跨模态检索数据集Flickr30k和MSCOCO上与现有方法进行对比实验,结果表明所提方法在图像-文本匹配任务上具有显著的优越性.Image-text matching aims to achieve high-quality semantic alignment between images and texts,which is an important task in the cross-disciplinary field of computer vision and natural language processing.Images and texts are two distinct mediums for conveying information.However,their differences in the content and distribution lead to uncertainty and ambiguity in fine-grained cross-modal information correlation.To address the challenges and enhance fine-grained alignment between images and texts,BSEM-Net(Bidirectional Semantic Embedding for Fine-Grained Image-Text Matching)is proposed.Firstly,in order to reduce redundancy in image information,this paper introduces IE(Image Semantic Embedding Module)that utilizes text words as supervisory signals to guide the model in constraining the expression of irrelevant image regions.Secondly,to reduce the distribution differences between modalities and establish fine-grained semantic alignment,this paper introduces TE(Text Semantic Embedding Module)that utilizes image regions to select words and transform these words into phrases that exhibit a similar information distribution to the image regions.In addition,the two modules utilize region relationship connectivity graphs and phrase relationship connectivity graphs to mine contextual information between intra modal features,reducing semantic divergence.Experimental comparisons are conducted on publicly available cross-modal retrieval datasets Flickr30k and MSCOCO,and the results demonstrate that the proposed method has significant superiority over existing methods in image-text matching tasks.

关键词：图文匹配跨模态语义嵌入细粒度信息关联语义对齐

分类号：TP391.41[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于双向语义嵌入的细粒度图文匹配方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于双向语义嵌入的细粒度图文匹配方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索