多模态预训练模型综述  被引量:9

Survey of multimodal pre-training models

在线阅读下载全文

作  者:王惠茹 李秀红[1] 李哲 马春明 任泽裕 杨丹 WANG Huiru;LI Xiuhong;LI Zhe;MA Chunming;REN Zeyu;YANG Dan(School of Information Science and Engineering,Xinjiang University,Urumqi Xinjiang 830046,China;Department of Electronic and Information Engineering,The Hong Kong Polytechnic University,Hong Kong 999077,China)

机构地区:[1]新疆大学信息科学与工程学院,乌鲁木齐830046 [2]香港理工大学电子及资讯工程学系,中国香港999077

出  处:《计算机应用》2023年第4期991-1004,共14页journal of Computer Applications

基  金:国家语委重点研发项目(ZDI135-96)。

摘  要:预训练模型(PTM)通过利用复杂的预训练目标和大量的模型参数,可以有效地获得无标记数据中的丰富知识。而在多模态中,PTM的发展还处于初期。根据具体模态的不同,将目前大多数的多模态PTM分为图像‒文本PTM和视频‒文本PTM;根据数据融合方式的不同,还可将多模态PTM分为单流模型和双流模型两类。首先,总结了常见的预训练任务和验证实验所使用的下游任务;接着,梳理了目前多模态预训练领域的常见模型,并用表格列出各个模型的下游任务以及模型的性能和实验数据比较;然后,介绍了M6(Multi-Modality to Multi-Modality Multitask Mega-transformer)模型、跨模态提示调优(CPT)模型、VideoBERT(Video Bidirectional Encoder Representations from Transformers)模型和AliceMind(Alibaba’s collection of encoder-decoders from Mind)模型在具体下游任务中的应用场景;最后,总结了多模态PTM相关工作面临的挑战以及未来可能的研究方向。By using complex pre-training targets and a large number of model parameters,Pre-Training Model(PTM)can effectively obtain rich knowledge from unlabeled data.However,the development of the multimodal PTMs is still in its infancy.According to the difference between modals,most of the current multimodal PTMs were divided into the image-text PTMs and video-text PTMs.According to the different data fusion methods,the multimodal PTMs were divided into two types:single-stream models and two-stream models.Firstly,common pre-training tasks and downstream tasks used in validation experiments were summarized.Secondly,the common models in the area of multimodal pre-training were sorted out,and the downstream tasks of each model and the performance and experimental data of the models were listed in tables for comparison.Thirdly,the application scenarios of M6(Multi-Modality to Multi-Modality Multitask Mega-transformer)model,Cross-modal Prompt Tuning(CPT)model,VideoBERT(Video Bidirectional Encoder Representations from Transformers)model,and AliceMind(Alibaba’s collection of encoder-decoders from Mind)model in specific downstream tasks were introduced.Finally,the challenges and future research directions faced by related multimodal PTM work were summed up.

关 键 词:多模态 预训练模型 图像-文本预训练模型 视频-文本预训练模型 神经网络 单流模型 双流模型 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象