基于预训练大模型的无监督图像字幕生成优化

Optimization of unsupervised image caption generation based on pre-training large models

作　　者：李炳楠丁濛[1] LI Bingnan;DING Meng(Perception and Intelligence Joint Laboratory,Beijing Information Science&Technology University,Beijing 100101,China)

机构地区：[1]北京信息科技大学感知与智能联合实验室,北京100101

出　　处：《北京信息科技大学学报(自然科学版)》2025年第1期11-19,共9页Journal of Beijing Information Science and Technology University(Science and Technology Edition)

基　　金：国家自然科学基金项目(11771349)。

摘　　要：图像字幕生成模型普遍依赖高质量的图像-文本对,且泛化能力较差。早期研究通过对比语言-图像预训练(contrastive language-imagepre-training,CLIP)模型的跨模态关联性,尝试利用无监督文本数据生成字幕,减少了对成对数据的依赖。然而,这些方法未能有效缩小CLIP文本与图像嵌入之间的差距,也未充分利用图像和文本的局部特征。为解决上述挑战,提出了一种基于纯文本训练的图像字幕生成框架——FusionCap。结合噪声网络和投影网络策略,有效缩小了文本与图像模态之间的差距,并引入局部特征提取模块,提升了模型对细粒度特征的捕捉能力。实验结果表明,FusionCap模型在字幕生成的准确性和细节描述方面显著优于现有的纯文本训练方法。尤其是在零样本生成场景中,生成的字幕在细节捕捉和语义一致性方面表现出色,验证了其良好的泛化能力和生成效果。Image caption generation models generally rely on high-quality image-text pairs and exhibit poor generalization capabilities.In the early research,the cross-modal associations of the contrastive language-image pre-training(CLIP)model were used to generate captions using unsupervised text data to reduce the reliance on paired data.However,these methods have failed to effectively narrow the gap between CLIP's text and image embeddings and have not fully utilized the local features of both image and text.To address these challenges,a text-only training-based image caption generation framework,FusionCap,was proposed.The noise networks and projection network strategies were combined to effectively reduce the gap between text and image modalities,and a local feature extraction module was introduced to enhance the model's ability to capture fine-grained features.Experimental results show that the FusionCap model significantly outperforms existing text-only training methods in terms of caption generation accuracy and detail description.Particularly in zero-shot generation scenarios,the generated captions exhibit excellent performance in detail capture and semantic consistency,validating its strong generalization ability and generation quality.

关键词：图像字幕生成多模态预训练模型无监督学习算法深度学习

分类号：TP391.41[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于预训练大模型的无监督图像字幕生成优化

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于预训练大模型的无监督图像字幕生成优化

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索