融合语义增强和位置编码的图文匹配方法

Image-Text Matching Method Combining Semantic Enhancement and Position Encoding

作　　者：赵婷婷[1] 常玉广郭宇陈亚瑞[1] 王嫄 ZHAO Tingting;CHANG Yuguang;GUO Yu;CHEN Yarui;WANG Yuan(College of Artificial Intelligence,Tianjin University of Science and Technology,Tianjin 300457,China)

机构地区：[1]天津科技大学人工智能学院,天津300457

出　　处：《天津科技大学学报》2024年第4期63-72,共10页Journal of Tianjin University of Science & Technology

基　　金：国家自然科学基金项目(61976156);天津市企业科技特派员项目(20YDTPJC00560)。

摘　　要：图文匹配是跨模态基础任务之一,其核心是如何准确评估图像语义与文本语义之间的相似度。现有方法是通过引入相关阈值,最大限度地区分相关和无关分布,以获得更好的语义对齐。然而,对于特征本身,其语义之间缺乏相互关联,且对于缺乏空间位置信息的图像区域与文本单词很难准确对齐,从而不可避免地限制了相关阈值的学习导致语义无法准确对齐。针对此问题,本文提出一种融合语义增强和位置编码的自适应相关性可学习注意力的图文匹配方法。首先,在初步提取特征的基础上构造图像(文本)无向全连通图,使用图注意力去聚合邻居的信息,获得语义增强的特征。然后,对图像区域的绝对位置信息编码,在具备了空间语义的图像区域与文本单词相似性的基础上获得最大程度区分的相关和无关分布,更好地学习两个分布之间的最优相关边界。最后,通过公开数据集Flickr 30 k和MSCOCO,利用Recall@K指标对比实验,验证本文方法的有效性。Image-text matching is one of the basic cross-modal tasks.Its core is how to accurately evaluate the similarity between image semantics and text semantics.Existing methods maximize the distinction between relevant and irrelevant distributions by introducing a correlation threshold to obtain better semantic alignment.However,for the features themselves,there is a lack of correlation between their semantics,and it is difficult to accurately align image areas and text words that lack spatial location information,which inevitably limits the learning of relevant thresholds and results in the inability to accurately align semantics.To address this problem,in this article we propose an image-text matching method that combines semantic enhancement and positional coding with adaptive correlation learnable attention.Specifically,an undirected fully connected graph of images(texts)is first constructed based on preliminary feature extraction,and graph attention is used to aggregate neighbor information to obtain semantically enhanced features.Then,the absolute position information of the image area is encoded,and the most differentiated relevant and irrelevant distributions are obtained based on the similarity between the image area and the text words with spatial semantics,so as to better learn the optimal correlation between the two distributions.boundary.Finally,through the public datasets Flickr 30 k and MS-COCO,the effectiveness of the method proposed in this article was verified with the use of the Recall@K indicator comparison experiment.

关键词：跨模态图文匹配图注意力位置编码相关性阈值

分类号：TP391.4[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合语义增强和位置编码的图文匹配方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合语义增强和位置编码的图文匹配方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索