Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval

作　　者：Xue-Yang Qin Li-Shuang Li Jing-Yao Tang Fei Hao Mei-Ling Ge Guang-Yao Pang 秦雪洋;李丽双;唐婧尧;郝飞;盖枚岭;庞光垚(School of Computer Science and Technology,Dalian University of Technology,Dalian 116024,China;School of Computer Science,Shaanxi Normal University,Xi’an 710119,China;School of Computer Engineering,Weifang University,Weifang 261061,China;Guangxi Colleges and Universities Key Laboratory of Intelligent Industry Software,Wuzhou University,Wuzhou 543002 China)

机构地区：[1]School of Computer Science and Technology,Dalian University of Technology,Dalian 116024,China [2]School of Computer Science,Shaanxi Normal University,Xi’an 710119,China [3]School of Computer Engineering,Weifang University,Weifang 261061,China [4]Guangxi Colleges and Universities Key Laboratory of Intelligent Industry Software,Wuzhou University,Wuzhou 543002 China

出　　处：《Journal of Computer Science & Technology》2024年第4期811-826,共16页计算机科学技术学报（英文版）

基　　金：supported by the National Natural Science Foundation of China under Grant No.62076048.

摘　　要：Image-text retrieval aims to capture the semantic correspondence between images and texts,which serves as a foundation and crucial component in multi-modal recommendations,search systems,and online shopping.Existing mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text retrieval.To this end,a multi-task visual semantic embedding network(MVSEN)is proposed for image-text retrieval.Specifically,we design two auxiliary tasks,including text-text matching and multi-label classification,for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training perspective.Besides,we present an intra-and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between modalities.Subsequently,we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text pairs.Experimental results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets,Flickr30K and MSCOCO,with rSum improvements of 8.2%and 3.0%,respectively.

关键词：image-text retrieval cross-modal retrieval multi-task learning graph convolutional network

分类号：TP391.41[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索