检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:赵婷婷[1] 常玉广 郭宇 陈亚瑞[1] 王嫄 ZHAO Tingting;CHANG Yuguang;GUO Yu;CHEN Yarui;WANG Yuan(College of Artificial Intelligence,Tianjin University of Science and Technology,Tianjin 300457,China)
出 处:《天津科技大学学报》2024年第4期63-72,共10页Journal of Tianjin University of Science & Technology
基 金:国家自然科学基金项目(61976156);天津市企业科技特派员项目(20YDTPJC00560)。
摘 要:图文匹配是跨模态基础任务之一,其核心是如何准确评估图像语义与文本语义之间的相似度。现有方法是通过引入相关阈值,最大限度地区分相关和无关分布,以获得更好的语义对齐。然而,对于特征本身,其语义之间缺乏相互关联,且对于缺乏空间位置信息的图像区域与文本单词很难准确对齐,从而不可避免地限制了相关阈值的学习导致语义无法准确对齐。针对此问题,本文提出一种融合语义增强和位置编码的自适应相关性可学习注意力的图文匹配方法。首先,在初步提取特征的基础上构造图像(文本)无向全连通图,使用图注意力去聚合邻居的信息,获得语义增强的特征。然后,对图像区域的绝对位置信息编码,在具备了空间语义的图像区域与文本单词相似性的基础上获得最大程度区分的相关和无关分布,更好地学习两个分布之间的最优相关边界。最后,通过公开数据集Flickr 30 k和MSCOCO,利用Recall@K指标对比实验,验证本文方法的有效性。Image-text matching is one of the basic cross-modal tasks.Its core is how to accurately evaluate the similarity between image semantics and text semantics.Existing methods maximize the distinction between relevant and irrelevant distributions by introducing a correlation threshold to obtain better semantic alignment.However,for the features themselves,there is a lack of correlation between their semantics,and it is difficult to accurately align image areas and text words that lack spatial location information,which inevitably limits the learning of relevant thresholds and results in the inability to accurately align semantics.To address this problem,in this article we propose an image-text matching method that combines semantic enhancement and positional coding with adaptive correlation learnable attention.Specifically,an undirected fully connected graph of images(texts)is first constructed based on preliminary feature extraction,and graph attention is used to aggregate neighbor information to obtain semantically enhanced features.Then,the absolute position information of the image area is encoded,and the most differentiated relevant and irrelevant distributions are obtained based on the similarity between the image area and the text words with spatial semantics,so as to better learn the optimal correlation between the two distributions.boundary.Finally,through the public datasets Flickr 30 k and MS-COCO,the effectiveness of the method proposed in this article was verified with the use of the Recall@K indicator comparison experiment.
分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.19.55.254