检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:安国成 江波 王晓龙 戴军 AN Guocheng;JIANG Bo;WANG Xiaolong;DAI Jun(Service Operations Department of Shanghai Huaxun Network System Co.,Ltd.,Shanghai 201103,China;The 32nd Research Institute of China Electronics Technology Group Corporation,Shanghai 201808,China)
机构地区:[1]上海华讯网络系统有限公司服务运作部,上海201103 [2]中国电子科技集团公司第三十二研究所,上海201808
出 处:《计算机工程》2024年第11期152-162,共11页Computer Engineering
基 金:“十四五”国家重点研发计划项目(2023YFC3006700)。
摘 要:基于对比语言-图像的预训练(CLIP)方法在大规模图文数据上使双流架构下的模型能够较好地学习到统一的高级语义表征,但CLIP模式仅约束图像-文本模态间的粗粒度语义对齐,在同一模态下的语义表征仍需改进。为了使网络学习到更好的潜在统一语义表征,提出一种基于拓展图文对比学习的多模态语义对齐方法。首先通过微调预训练的CLIP模型,针对指定数据集优化语义表征,设计双向匹配策略构造图文样本匹配拓扑图,然后利用拓扑图中关联度更高的图文样本将对比学习进行拓展,在图像-文本模态下进行粗粒度语义对齐,同时在相同模态中进行细粒度调整,并引入可学习参数调整各模态下的对比损失权重。通过在多个数据集下的实验结果表明,该方法在不影响多模态语义对齐的性能下能够改进相同模态下的语义表征,在分类、检索等下游任务上具有更好或相当的性能。The current pre-training model,known as Contrastive Language-Image Pre-training(CLIP),facilitates a dual-stream architecture that allows the model to learn unified high-level semantic representations on large-scale text-image data.However,the CLIP model only enforces coarse-grained semantic alignment between image-text modalities,and semantic representation within the same modality requires further improvement.To assist the network in learning unified semantic representations more effectively,this study proposes a multimodal semantic alignment method based on extended image-text contrastive learning.First,by fine-tuning the pre-trained CLIP model,the semantic representation is optimized for the specified dataset,and a bidirectional matching strategy is designed to construct a text-image sample-matching topology graph.Contrastive learning is then extended using image-text samples with higher relevance in the topology graph,facilitating coarse-grained semantic alignment between image-text modalities.Fine-grained adjustments are performed within the same modality and learnable parameters are introduced to adjust the contrastive loss weights under each modality.Experiments on multiple datasets demonstrate that this method improves semantic representation within the same modality without affecting the performance of multimodal semantic alignment,and facilitates better or comparable performance in classification,retrieval,and other downstream tasks.
关 键 词:多模态学习 语义表征 对比学习 图文匹配 图像分类
分 类 号:TP183[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.33