检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:尹晶晶 潘丽丽[1] 王朝[1] 熊思宇 瞿栋梁 Yin Jingjing;Pan Lili;Wang Chao;Xiong Siyu;Qu Dongliang(School of Electronic Information and Physics,Central South University of Forestry and Technology,Changsha,410004,China)
机构地区:[1]中南林业科技大学电子信息与物理学院,长沙410004
出 处:《南京大学学报(自然科学版)》2024年第5期804-814,共11页Journal of Nanjing University(Natural Science)
基 金:湖南省教育厅科学研究重点项目(22A0195);湖南省教育厅教学改革研究项目(HNJG-20230471)
摘 要:图像-文本匹配旨在实现图像与文本的高质量语义对齐,是计算机视觉与自然语言处理交叉领域的一种重要任务.图像与文本是两种不同的信息载体,其信息内容和数据分布的差异容易造成跨模态细粒度信息关联的不确定和模糊.为了解决上述问题,根据图像-文本对的语义一致性,提出了基于双向语义嵌入的细粒度图文匹配方法(Bidirectional Semantic Embedding for Fine-Grained Image-Text Matching,BSEM-Net),通过图像到文本和文本到图像双向语义嵌入的方式来提升图像和文本细粒度对齐的准确性.第一,为了减少图像信息冗余,构造了图像语义嵌入模块,利用文本单词作为监督信号,引导模型限制不相关图像区域的表达;第二,为了减少模态间信息分布差异,更好地建立细粒度语义对齐,构造了文本语义嵌入模块,利用图像区域选择单词形成集合体,进而转化为与图像区域信息分布相似的短语.此外,两个模块分别利用图像区域关系连通图和短语关系连通图挖掘模态内特征之间的上下文信息,减少语义发散.在公开的跨模态检索数据集Flickr30k和MSCOCO上与现有方法进行对比实验,结果表明所提方法在图像-文本匹配任务上具有显著的优越性.Image-text matching aims to achieve high-quality semantic alignment between images and texts,which is an important task in the cross-disciplinary field of computer vision and natural language processing.Images and texts are two distinct mediums for conveying information.However,their differences in the content and distribution lead to uncertainty and ambiguity in fine-grained cross-modal information correlation.To address the challenges and enhance fine-grained alignment between images and texts,BSEM-Net(Bidirectional Semantic Embedding for Fine-Grained Image-Text Matching)is proposed.Firstly,in order to reduce redundancy in image information,this paper introduces IE(Image Semantic Embedding Module)that utilizes text words as supervisory signals to guide the model in constraining the expression of irrelevant image regions.Secondly,to reduce the distribution differences between modalities and establish fine-grained semantic alignment,this paper introduces TE(Text Semantic Embedding Module)that utilizes image regions to select words and transform these words into phrases that exhibit a similar information distribution to the image regions.In addition,the two modules utilize region relationship connectivity graphs and phrase relationship connectivity graphs to mine contextual information between intra modal features,reducing semantic divergence.Experimental comparisons are conducted on publicly available cross-modal retrieval datasets Flickr30k and MSCOCO,and the results demonstrate that the proposed method has significant superiority over existing methods in image-text matching tasks.
关 键 词:图文匹配 跨模态 语义嵌入 细粒度信息关联 语义对齐
分 类 号:TP391.41[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.3