Modality-experts coordinated adaptation for large multimodal models  

在线阅读下载全文

作  者:Yan ZHANG Zhong JI Yanwei PANG Jungong HAN Xuelong LI 

机构地区:[1]School of Electrical and Information Engineering,Tianjin Key Laboratory of Brain-Inspired Intelligence Technology,Tianjin University,Tianjin 300072,China [2]Shanghai Artificial Intelligence Laboratory,Shanghai 200232,China [3]Department of Automation,Tsinghua University,Beijing 100084,China [4]Institute of Artificial Intelligence(TeleAI),China Telecom Corporation Limited,Beijing 100033,China

出  处:《Science China(Information Sciences)》2024年第12期75-92,共18页中国科学(信息科学)(英文版)

基  金:supported by National Key Research and Development Program of China (Grant No.2022ZD0160403);National Natural Science Foundation of China (Grant No.62176178)。

摘  要:Driven by the expansion of foundation models and the increasing variety of downstream tasks,parameter-efficient fine-tuning(PEFT) methods have exhibited remarkable efficacy in the unimodal domain,effectively mitigating the consumption of computational resources. Although recent research has shifted attention to the multimodal domain and achieved efficient parametric adaptation of large multimodal models(LMMs) for downstream tasks, they still encounter two limitations:(1) low performance;(2) poor compatibility. This work proposes a modality-experts coordinated adaptation(ModeX) method for the multimodal domain, offering an effective, plug-and-play, and lightweight adaptation architecture for diverse LMMs.Specifically, ModeX adaptively coordinates different modality experts in terms of the types of network structure and input data. Besides, an effective coordinator equipped with a routing algorithm is developed for generating corresponding weights, which centers on leveraging the synergy among multimodal data. Extensive experiments on 15 multimodal downstream benchmarks and five LMMs demonstrate that ModeX is capable of seamlessly adapting to diverse LMMs, outperforms the state-of-the-art PEFT methods and even exhibits superior performance compared with full fine-tuning methods. Notably, on NLVR~2 task, ModeX achieves 84.06% accuracy with only 12.0M trainable parameters, outperforming the full fine-tuning by 1.63%.Moreover, our ModeX method demonstrates superior stability and offers higher training efficiency, both in terms of training parameters and training duration. Our source code has been released at https://github.com/zhangy0822/ModeX.

关 键 词:large multimodal model multimodal learning vision-language pretraining parameter-efficient fine-tuning ADAPTER modality expert 

分 类 号:TN9[电子电信—信息与通信工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象