基于深度学习的图像-文本匹配研究综述  被引量:8

A Survey on Deep Learning Based Image-Text Matching

在线阅读下载全文

作  者:刘萌 齐孟津 詹圳宇 曲磊钢 聂秀山 聂礼强 LIU Meng;QI Meng-Jin;ZHAN Zhen-Yu;QU Lei-Gang;NIE Xiu-Shan;NIE Li-Qiang(Department of Computer Science and Technology,Shandong Jianzhu University,Jinan 250101;Department of Computer Science and Technology,Shandong University,Qingdao,Shandong 266000;Department of Computer Science and Technology(Shenzhen),Harbin Institute of Technology,Shenzhen,Guangdong 518055)

机构地区:[1]山东建筑大学计算机科学与技术学院,济南250101 [2]山东大学(青岛)计算机科学与技术学院,山东青岛266000 [3]哈尔滨工业大学(深圳)计算机科学与技术学院,广东深圳518055

出  处:《计算机学报》2023年第11期2370-2399,共30页Chinese Journal of Computers

基  金:国家自然科学基金项目(No.62006142、No.U1936203);山东省杰出青年基金项目(No.ZR2021JQ26);山东省基金重大基础研究项目(No.ZR2021ZD15);山东省高等学校青年创新科技创新计划(No.2021KJ036);山东建筑大学特聘教授专项基金资助。

摘  要:图像-文本匹配任务旨在衡量图像和文本描述之间的相似性,其在桥接视觉和语言中起着至关重要的作用.近年来,图像与句子的全局对齐以及区域与单词的局部对齐研究方面取得了很大的进展.本文对当前先进的研究方法进行分类和描述.具体地,本文将现有方法划分为基于全局特征的图像-文本匹配方法、基于局部特征的图像-文本匹配方法、基于外部知识的图像-文本匹配方法、基于度量学习的图像-文本匹配方法以及多模态预训练模型,对于基于全局特征的图像-文本匹配方法,本文依据流程类型划分为两类:基于嵌入的方法和基于交互的方法;而对于基于局部特征的图像-文本匹配方法,依据其交互模式的不同,则被细分为三类:基于模态内关系建模的方法、基于模态间关系建模的方法以及基于混合交互建模的方法.随后,本文对当前图像-文本匹配任务的相关数据集进行了整理,并对现有方法的实验结果进行分析与总结.最后,对未来研究可能面临的挑战进行了展望.Recent years have witnessed the rapid growth of multimedia data,such as texts and images,inducing many researchers to work on multimodal representation,understanding,and reasoning.As a fun-damental task of multimodal interaction,image-text matching,focusing on measuring the semantic similar-ity between an image and a text,has attracted extensive research attention.It indeed facilitates various ap-plications,such as cross-modal retrieval,visual question answering,and multimedia understanding,and plays a critical role in bridging vision and language.Recently,deep learning techniques have emerged as powerful methods for various tasks.This motivates many researchers to resort to deep learning approaches to tackle the image-text matching task.Particularly,great progress has been made by exploiting the global alignment between images and sentences,or local alignments between image regions and textual words.They can be roughly divided into the following categories:global representation-based image-text matching methods,local representation-based image-text matching methods,external knowledge-based image-text matching methods,metric learning-based image-text matching methods,and multimodal pre-training models.To be specific,global representation-based image-text matching methods usually realize cross-modal matching by measuring the semantic similarity between the global image and text representations;local representation-based image-text matching methods focus on modeling fine-grained correlations between visual and textual entities;external knowledge-based image-text matching methods are devoted to acquire certain prior knowledge from external sources,such as scene graph,to improve the accuracy of image-text matching;metric learning-based image-text matching methods try to explore a better constraint or similarity measurement to improve the discriminability between unpaired samples and the relevance between the paired samples;as well as the multimodal pre-training models including single stream and two stream frameworks

关 键 词:图像-文本匹配 跨模态图像检索 多模态预训练模型 综述 深度学习 人工智能 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象