检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王惠茹 李秀红[1] 李哲 马春明 任泽裕 杨丹 WANG Huiru;LI Xiuhong;LI Zhe;MA Chunming;REN Zeyu;YANG Dan(School of Information Science and Engineering,Xinjiang University,Urumqi Xinjiang 830046,China;Department of Electronic and Information Engineering,The Hong Kong Polytechnic University,Hong Kong 999077,China)
机构地区:[1]新疆大学信息科学与工程学院,乌鲁木齐830046 [2]香港理工大学电子及资讯工程学系,中国香港999077
出 处:《计算机应用》2023年第4期991-1004,共14页journal of Computer Applications
基 金:国家语委重点研发项目(ZDI135-96)。
摘 要:预训练模型(PTM)通过利用复杂的预训练目标和大量的模型参数,可以有效地获得无标记数据中的丰富知识。而在多模态中,PTM的发展还处于初期。根据具体模态的不同,将目前大多数的多模态PTM分为图像‒文本PTM和视频‒文本PTM;根据数据融合方式的不同,还可将多模态PTM分为单流模型和双流模型两类。首先,总结了常见的预训练任务和验证实验所使用的下游任务;接着,梳理了目前多模态预训练领域的常见模型,并用表格列出各个模型的下游任务以及模型的性能和实验数据比较;然后,介绍了M6(Multi-Modality to Multi-Modality Multitask Mega-transformer)模型、跨模态提示调优(CPT)模型、VideoBERT(Video Bidirectional Encoder Representations from Transformers)模型和AliceMind(Alibaba’s collection of encoder-decoders from Mind)模型在具体下游任务中的应用场景;最后,总结了多模态PTM相关工作面临的挑战以及未来可能的研究方向。By using complex pre-training targets and a large number of model parameters,Pre-Training Model(PTM)can effectively obtain rich knowledge from unlabeled data.However,the development of the multimodal PTMs is still in its infancy.According to the difference between modals,most of the current multimodal PTMs were divided into the image-text PTMs and video-text PTMs.According to the different data fusion methods,the multimodal PTMs were divided into two types:single-stream models and two-stream models.Firstly,common pre-training tasks and downstream tasks used in validation experiments were summarized.Secondly,the common models in the area of multimodal pre-training were sorted out,and the downstream tasks of each model and the performance and experimental data of the models were listed in tables for comparison.Thirdly,the application scenarios of M6(Multi-Modality to Multi-Modality Multitask Mega-transformer)model,Cross-modal Prompt Tuning(CPT)model,VideoBERT(Video Bidirectional Encoder Representations from Transformers)model,and AliceMind(Alibaba’s collection of encoder-decoders from Mind)model in specific downstream tasks were introduced.Finally,the challenges and future research directions faced by related multimodal PTM work were summed up.
关 键 词:多模态 预训练模型 图像-文本预训练模型 视频-文本预训练模型 神经网络 单流模型 双流模型
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.137.210.169