视觉语言预训练综述  被引量:9

Survey on Vision-language Pre-training

在线阅读下载全文

作  者:殷炯 张哲东 高宇涵 杨智文 李亮 肖芒[5] 孙垚棋 颜成钢 YIN Jiong;ZHANG Zhe-Dong;GAO Yu-Han;YANG Zhi-Wen;LI Liang;XIAO Mang;SUN Yao-Qi;YAN Cheng-Gang(College of Computer Science and Technology,Hangzhou Dianzi University,Hangzhou 310018,China;Lishui Institute of Hangzhou Dianzi University,Lishui 323000,China;School of Automation,Hangzhou Dianzi University,Hangzhou 210016,China;Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;Sir Run Run Shaw Hospital,College of Medicine,Zhejiang University,Hangzhou 310016,China)

机构地区:[1]杭州电子科技大学计算机学院,浙江杭州310018 [2]杭州电子科技大学丽水研究院,浙江丽水323000 [3]杭州电子科技大学自动化学院,浙江杭州310018 [4]中国科学院计算技术研究所,北京100190 [5]浙江大学医学院附属邵逸夫医院,浙江杭州310016

出  处:《软件学报》2023年第5期2000-2023,共24页Journal of Software

基  金:国家重点研发计划(2020YFB1406604);国家自然科学基金(61931008,62071415,U21B2024)。

摘  要:近年来深度学习在计算机视觉(CV)和自然语言处理(NLP)等单模态领域都取得了十分优异的性能.随着技术的发展,多模态学习的重要性和必要性已经慢慢展现.视觉语言学习作为多模态学习的重要部分,得到国内外研究人员的广泛关注.得益于Transformer框架的发展,越来越多的预训练模型被运用到视觉语言多模态学习上,相关任务在性能上得到了质的飞跃.系统地梳理了当前视觉语言预训练模型相关的工作,首先介绍了预训练模型的相关知识,其次从两种不同的角度分析比较预训练模型结构,讨论了常用的视觉语言预训练技术,详细介绍了5类下游预训练任务,最后介绍了常用的图像和视频预训练任务的数据集,并比较和分析了常用预训练模型在不同任务下不同数据集上的性能.In recent years,deep learning has achieved excellent performance in unimodal areas such as computer vision(CV)and natural language processing(NLP).With the development of technology,the importance and necessity of multimodal learning begin to unfold.Essential to multimodal learning,vision-language learning has received extensive attention from researchers in and outside China.Thanks to the development of the Transformer framework,more and more pre-trained models are applied to vision-language multimodal learning,and the performance of related tasks is improved qualitatively.This study systematically reviews the current work on vision-language pretrained models.Firstly,the knowledge about pre-trained models is introduced.Secondly,the structure of pre-trained models is analyzed and compared from two perspectives.The commonly used vision-language pre-training techniques are discussed,and five downstream pretraining tasks are elaborated.Finally,the common datasets used in image and video pre-training tasks are expounded,and the performance of commonly used pre-trained models on different datasets under different tasks is compared and analyzed.

关 键 词:多模态学习 预训练模型 TRANSFORMER 视觉语言学习 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象