基于选择状态空间的三模态适配器  

Tri-modal adapter based on selective state space

在线阅读下载全文

作  者:刘弘业 陈锡爱 曾涛 LIU Hongye;CHEN Xiai;ZENG Tao(College of Mechanical and Electrical Engineering,China Jiliang University,Hangzhou Zhejiang 310018,China)

机构地区:[1]中国计量大学机电工程学院,杭州310018

出  处:《计算机应用》2025年第2期411-420,共10页journal of Computer Applications

基  金:国家自然科学基金资助项目(52005472)。

摘  要:预训练再微调范式广泛应用于各种单模态和多模态的任务中。然而,随着模型规模的指数级别增长,微调预训练模型的所有参数变得非常困难。为了解决这个问题,设计一种基于选择状态空间的三模态适配器,它可以冻结预训练模型,只针对少量额外的参数微调,并完成三模态间的密集交互。具体地,提出一个基于选择状态空间的长期语义选择模块和一个基于视觉或音频中心的短期语义交互模块,这两个模块被按顺序插入各顺序编码器之间,以完成三模态信息的密集交互。长期语义选择模块旨在抑制三模态中的冗余信息,短期语义交互模块则对短时间内的局部模态特征进行交互建模。与之前需要在大规模三模态数据集上进行预训练的方法相比,所提方法更灵活,它可以继承任意强大的单模态或双模态模型。在Music-AVQA三模态评测数据集上,所提方法取得了80.19%的平均准确率,较LAVISH提升了4.09个百分点。The pre-training-then-fine-tuning paradigm is widely used in a variety of unimodal and multimodal tasks.However,as the model size grows exponentially,it becomes very difficult to fine-tune all the parameters of the pre-trained model.To solve this problem,a tri-modal adapter based on selective state space was designed,which can freeze the pretrained model,fine-tune only a small number of additional parameters,and accomplish intensive interactions among three modalities.Specifically,a long-term semantic selection module based on selective state space and a short-term semantic interaction module based on visual or audio center were proposed and inserted among the sequential encoders sequentially to accomplish the intensive interactions among tri-modal information.The long-term semantic selection module aims at suppressing redundant information in three modalities,while the short-term semantic interaction module models the interactions of local modal features in a short term.Compared to previous methods that require pre-training on large-scale trimodal datasets,the proposed method is more flexible,and it can inherit powerful unimodal or bimodal models arbitrarily.On Music-AVQA tri-modal evaluation dataset,the proposed method achieves an average accuracy of 80.19%,with an improvement of 4.09 percentage points compared to LAVISH.

关 键 词:预训练再微调 选择状态空间 三模态 长期语义 短期语义 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象