基于图像-文本大模型CLIP微调的零样本参考图像分割

Zero-shot referring image segmentation based on fine-tuning image-text model CLIP

作　　者：刘杰[1,2] 乔文昇朱佩佩[1] 雷印杰王紫轩[3] Liu Jie;Qiao Wensheng;Zhu Peipei;Lei Yinjie;Wang Zixuan(Southwest China Institute of Electronic Technology,Chengdu 610036,China;School of Resources&Environment,University of Electronic Science&Technology of China,Chengdu 611731,China;School of Electronics&Information Engineering,Sichuan University,Chengdu 610065,China)

机构地区：[1]中国西南电子技术研究所,成都610036 [2]电子科技大学资源与环境学院,成都611731 [3]四川大学电子信息学院,成都610065

出　　处：《计算机应用研究》2025年第4期1248-1254,共7页Application Research of Computers

基　　金：国家自然科学基金资助项目(62303433)。

摘　　要：近年来,以CLIP为代表的视觉-语言大模型在众多下游场景中显示出了出色的零样本推理能力,然而将CLIP模型迁移至需要像素水平图-文理解的参考图像分割中非常困难,其根本原因在于CLIP关注图像-文本整体上的对齐情况,却丢弃了图像中像素点的空间位置信息。鉴于此,以CLIP为基础模型,提出了一种单阶段、细粒度、多层次的零样本参考图像分割模型PixelCLIP。具体地,采取了多尺度的图像特征融合,既聚集CLIP中不同视觉编码器提取的图像像素级特征,同时又考虑CLIP中固有的图像整体语义特征。在文本信息表征上,不但依靠CLIP-BERT来保持物体种类信息,还引入LLaVA大语言模型进一步注入上下文背景知识。最后,PixelCLIP通过细粒度跨模态关联匹配,实现像素水平的参考图像分割。充分的数值分析结果验证了该方法的有效性。In recent years,large vision-language models represented by CLIP have demonstrated excellent zero-shot inference capabilities in numerous downstream scenarios.However,transferring the CLIP model to reference image segmentation,which requires pixel-level image-text understanding,is very challenging.The fundamental reason lies in the fact that CLIP focuses on the overall alignment between images and text while discarding the spatial position information of pixels in the image.In view of this,this paper proposed a single-stage,fine-grained,multi-level zero-shot reference image segmentation model called Pixel-CLIP based on the CLIP model.Specifically,this paper adopted multi-scale image feature fusion,which not only aggregated pixel-level image features extracted by different visual encoders in CLIP,but also considered the inherent overall semantic features of images in CLIP.In terms of textual information representation,this paper relied not only on CLIP-BERT to maintain object category information,but also introduced the LLaVA large language model to further inject contextual background knowledge.Ultimately,PixelCLIP achieves pixel-level reference image segmentation by realizing fine-grained cross-modal associative matching.Extensive experiments indicate the validity of PixelCLIP.

关键词：零样本 CLIP 像素级单阶段参考图像分割

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于图像-文本大模型CLIP微调的零样本参考图像分割

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于图像-文本大模型CLIP微调的零样本参考图像分割

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索