检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:Yan ZHANG Zhong JI Yanwei PANG Jungong HAN Xuelong LI
机构地区:[1]School of Electrical and Information Engineering,Tianjin Key Laboratory of Brain-Inspired Intelligence Technology,Tianjin University,Tianjin 300072,China [2]Shanghai Artificial Intelligence Laboratory,Shanghai 200232,China [3]Department of Automation,Tsinghua University,Beijing 100084,China [4]Institute of Artificial Intelligence(TeleAI),China Telecom Corporation Limited,Beijing 100033,China
出 处:《Science China(Information Sciences)》2024年第12期75-92,共18页中国科学(信息科学)(英文版)
基 金:supported by National Key Research and Development Program of China (Grant No.2022ZD0160403);National Natural Science Foundation of China (Grant No.62176178)。
摘 要:Driven by the expansion of foundation models and the increasing variety of downstream tasks,parameter-efficient fine-tuning(PEFT) methods have exhibited remarkable efficacy in the unimodal domain,effectively mitigating the consumption of computational resources. Although recent research has shifted attention to the multimodal domain and achieved efficient parametric adaptation of large multimodal models(LMMs) for downstream tasks, they still encounter two limitations:(1) low performance;(2) poor compatibility. This work proposes a modality-experts coordinated adaptation(ModeX) method for the multimodal domain, offering an effective, plug-and-play, and lightweight adaptation architecture for diverse LMMs.Specifically, ModeX adaptively coordinates different modality experts in terms of the types of network structure and input data. Besides, an effective coordinator equipped with a routing algorithm is developed for generating corresponding weights, which centers on leveraging the synergy among multimodal data. Extensive experiments on 15 multimodal downstream benchmarks and five LMMs demonstrate that ModeX is capable of seamlessly adapting to diverse LMMs, outperforms the state-of-the-art PEFT methods and even exhibits superior performance compared with full fine-tuning methods. Notably, on NLVR~2 task, ModeX achieves 84.06% accuracy with only 12.0M trainable parameters, outperforming the full fine-tuning by 1.63%.Moreover, our ModeX method demonstrates superior stability and offers higher training efficiency, both in terms of training parameters and training duration. Our source code has been released at https://github.com/zhangy0822/ModeX.
关 键 词:large multimodal model multimodal learning vision-language pretraining parameter-efficient fine-tuning ADAPTER modality expert
分 类 号:TN9[电子电信—信息与通信工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.137.142.253