融合全局语义的CLIP-GPT图像描述模型

The CLIP-GPT Image Captioning Model Integrated with Global Semantics

作　　者：陶锐任洪娥[1,3] 曹海燕 TAO Rui;REN Honge;CAO Haiyan(College of Information and Computer Engineering,Northeast Forestry University,Harbin 150040,China;College of Computer Science,Hulunbuir University,Hulunbuir 021008,China;Heilongjiang Forestry Intelligent Equipment Engineering Research Center,Harbin 150040,China)

机构地区：[1]东北林业大学信息与计算机工程学院,哈尔滨150040 [2]呼伦贝尔学院计算机学院,内蒙古呼伦贝尔021008 [3]黑龙江省林业智能装备工程研究中心,哈尔滨150040

出　　处：《哈尔滨理工大学学报》2024年第2期16-24,共9页Journal of Harbin University of Science and Technology

基　　金：黑龙江省自然科学基金(LH2020F040);中央高校基本科研业务费专项资金资助项目(2572017PZ10)。

摘　　要：图像描述是指为图像自动生成与其内容相符的语言描述。桥接计算机视觉和自然语言处理两个领域的预训练模型构建图像描述模型时,跨模态语义一致性是共享子空间嵌入的核心问题。本文将图像拆分成若干片作为视觉语义单元与语言特征进行自由的跨模态关联,突破了有限视觉特征分类的限制;联合运用掩码学习和图文特征匹配两个损失函数,挑选高难度负样本训练跨模态跳接网络提取一致性全局语义,提高了子空间邻域内高相似度图文特征点匹配的准确度。在MS COCO和Flickr30k两个数据集上的实验结果表明,与同样采用CLIP+GPT生成图像描述的模型及其他主流模型相比,性能均有提升,证明了所提出模型的有效性。Image captioning is a method for automatically generating language descriptions for images.Cross-modal semantic consistency is the core issue of shared subspace embedding when bridging pre-training models in the fields of computer vision and natural language processing to construct image captioning models.In this paper,we introduce a new method that breaks through the limitation of visual feature classification by dividing images into patches as visual semantic units for open-vocabulary cross-modal association with language features.It combines the two loss functions of masked language modeling and image-text matching,selects highly difficult negative samples to train the cross-modal hop network to extract consistent global semantics,improving the accuracy of distinguishing highly similar image and text feature points within the neighborhood of the subspace.Experimental results on two datasets,MS COCO and Flickr30k,show that the performance of the model is improved compared to models that also use CLIP+GPT to generate image descriptions and other mainstream methods,demonstrating the effectiveness of the proposed method.

关键词：跨模态图像描述预训练模型共享子空间语义对齐

分类号：TP751.1[自动化与计算机技术—检测技术与自动化装置]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合全局语义的CLIP-GPT图像描述模型

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合全局语义的CLIP-GPT图像描述模型

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索