检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李炳楠 丁濛[1] LI Bingnan;DING Meng(Perception and Intelligence Joint Laboratory,Beijing Information Science&Technology University,Beijing 100101,China)
机构地区:[1]北京信息科技大学感知与智能联合实验室,北京100101
出 处:《北京信息科技大学学报(自然科学版)》2025年第1期11-19,共9页Journal of Beijing Information Science and Technology University(Science and Technology Edition)
基 金:国家自然科学基金项目(11771349)。
摘 要:图像字幕生成模型普遍依赖高质量的图像-文本对,且泛化能力较差。早期研究通过对比语言-图像预训练(contrastive language-imagepre-training,CLIP)模型的跨模态关联性,尝试利用无监督文本数据生成字幕,减少了对成对数据的依赖。然而,这些方法未能有效缩小CLIP文本与图像嵌入之间的差距,也未充分利用图像和文本的局部特征。为解决上述挑战,提出了一种基于纯文本训练的图像字幕生成框架——FusionCap。结合噪声网络和投影网络策略,有效缩小了文本与图像模态之间的差距,并引入局部特征提取模块,提升了模型对细粒度特征的捕捉能力。实验结果表明,FusionCap模型在字幕生成的准确性和细节描述方面显著优于现有的纯文本训练方法。尤其是在零样本生成场景中,生成的字幕在细节捕捉和语义一致性方面表现出色,验证了其良好的泛化能力和生成效果。Image caption generation models generally rely on high-quality image-text pairs and exhibit poor generalization capabilities.In the early research,the cross-modal associations of the contrastive language-image pre-training(CLIP)model were used to generate captions using unsupervised text data to reduce the reliance on paired data.However,these methods have failed to effectively narrow the gap between CLIP's text and image embeddings and have not fully utilized the local features of both image and text.To address these challenges,a text-only training-based image caption generation framework,FusionCap,was proposed.The noise networks and projection network strategies were combined to effectively reduce the gap between text and image modalities,and a local feature extraction module was introduced to enhance the model's ability to capture fine-grained features.Experimental results show that the FusionCap model significantly outperforms existing text-only training methods in terms of caption generation accuracy and detail description.Particularly in zero-shot generation scenarios,the generated captions exhibit excellent performance in detail capture and semantic consistency,validating its strong generalization ability and generation quality.
关 键 词:图像字幕生成 多模态 预训练模型 无监督学习算法 深度学习
分 类 号:TP391.41[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.3