基于拓展图文对比学习的多模态语义对齐  

Multi-modal Semantic Alignment Based on Extended Image-Text Contrastive Learning

在线阅读下载全文

作  者:安国成 江波 王晓龙 戴军 AN Guocheng;JIANG Bo;WANG Xiaolong;DAI Jun(Service Operations Department of Shanghai Huaxun Network System Co.,Ltd.,Shanghai 201103,China;The 32nd Research Institute of China Electronics Technology Group Corporation,Shanghai 201808,China)

机构地区:[1]上海华讯网络系统有限公司服务运作部,上海201103 [2]中国电子科技集团公司第三十二研究所,上海201808

出  处:《计算机工程》2024年第11期152-162,共11页Computer Engineering

基  金:“十四五”国家重点研发计划项目(2023YFC3006700)。

摘  要:基于对比语言-图像的预训练(CLIP)方法在大规模图文数据上使双流架构下的模型能够较好地学习到统一的高级语义表征,但CLIP模式仅约束图像-文本模态间的粗粒度语义对齐,在同一模态下的语义表征仍需改进。为了使网络学习到更好的潜在统一语义表征,提出一种基于拓展图文对比学习的多模态语义对齐方法。首先通过微调预训练的CLIP模型,针对指定数据集优化语义表征,设计双向匹配策略构造图文样本匹配拓扑图,然后利用拓扑图中关联度更高的图文样本将对比学习进行拓展,在图像-文本模态下进行粗粒度语义对齐,同时在相同模态中进行细粒度调整,并引入可学习参数调整各模态下的对比损失权重。通过在多个数据集下的实验结果表明,该方法在不影响多模态语义对齐的性能下能够改进相同模态下的语义表征,在分类、检索等下游任务上具有更好或相当的性能。The current pre-training model,known as Contrastive Language-Image Pre-training(CLIP),facilitates a dual-stream architecture that allows the model to learn unified high-level semantic representations on large-scale text-image data.However,the CLIP model only enforces coarse-grained semantic alignment between image-text modalities,and semantic representation within the same modality requires further improvement.To assist the network in learning unified semantic representations more effectively,this study proposes a multimodal semantic alignment method based on extended image-text contrastive learning.First,by fine-tuning the pre-trained CLIP model,the semantic representation is optimized for the specified dataset,and a bidirectional matching strategy is designed to construct a text-image sample-matching topology graph.Contrastive learning is then extended using image-text samples with higher relevance in the topology graph,facilitating coarse-grained semantic alignment between image-text modalities.Fine-grained adjustments are performed within the same modality and learnable parameters are introduced to adjust the contrastive loss weights under each modality.Experiments on multiple datasets demonstrate that this method improves semantic representation within the same modality without affecting the performance of multimodal semantic alignment,and facilitates better or comparable performance in classification,retrieval,and other downstream tasks.

关 键 词:多模态学习 语义表征 对比学习 图文匹配 图像分类 

分 类 号:TP183[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象