基于Transformer视觉特征融合的图像描述方法

Image Captioning Method Based on Transformer Visual Features Fusion

作　　者：白雪冰[1,2] 车进[2,3] 吴金蔓陈玉敏 BAI Xuebing;CHE Jin;WU Jinman;CHEN Yumin(School of Advanced Interdisciplinary,Ningxia University,Zhongwei 755000,Ningxia,China;Ningxia Key Laboratory of Intelligent Sensing for Desert Information,Ningxia University,Yinchuan 750021,Ningxia,China;School of Electronic and Electrical Engineering,Ningxia University,Yinchuan 750021,Ningxia,China)

机构地区：[1]宁夏大学前沿交叉学院,宁夏中卫755000 [2]宁夏大学宁夏沙漠信息智能感知重点实验室,宁夏银川750021 [3]宁夏大学电子与电气工程学院,宁夏银川750021

出　　处：《计算机工程》2024年第8期229-238,共10页Computer Engineering

基　　金：国家自然科学基金(62366042);宁夏自然科学基金(2023AAC03127)。

摘　　要：现有图像描述方法只利用区域型视觉特征生成描述语句,忽略了网格型视觉特征的重要性,并且均为两阶段方法,从而影响了图像描述的质量。针对该问题,提出一种基于Transformer视觉特征融合的端到端图像描述方法。首先,在特征提取阶段,利用视觉特征提取器提取出区域型视觉特征和网格型视觉特征;其次,在特征融合阶段,通过视觉特征融合模块对区域型视觉特征和网格型视觉特征进行拼接;最后,将所有的视觉特征送入语言生成器中以生成图像描述。该方法各部分均基于Transformer模型实现,实现了一阶段方法。在MS-COCO数据集上的实验结果表明,所提方法能够充分利用区域型视觉特征与网格型视觉特征的优势,BLEU-1、BLEU-4、METEOR、ROUGE-L、CIDEr、SPICE指标分别达到83.1%、41.5%、30.2%、60.1%、140.3%、23.9%,优于目前主流的图像描述方法,能够生成更加准确和丰富的描述语句。Existing image captioning methods only use regional visual features to generate description statements and ignore the importance of grid visual features.Moreover,as these methods are two-stage approaches,image captioning quality is affected.To address this issue,this study proposes an end-to-end image captioning method based on the visual feature fusion of Transformer.First,in the feature extraction stage,the visual feature extractor is used to extract regional and grid visual features.Second,in the feature fusion stage,the regional and grid visual features are concatenated using a visual feature fusion module.Finally,the visual features are sent to the language generator to realize image captioning.All components of the method are implemented based on the Transformer model,which is a one-stage method.The experimental results on the MS-COCO dataset show that the proposed method can fully utilize the respective advantages of regional and grid visual features,with the BLEU-1,BLEU-4,METEOR,ROUGE-L,CIDEr,and SPICE metrics reaching 83.1%,41.5%,30.2%,60.1%,140.3%,and 23.9%,respectively,indicating that the proposed method is superior to mainstream image captioning methods and can generate more accurate and rich description statements.

关键词：图像描述区域型视觉特征网格型视觉特征 Transformer模型端到端训练

分类号：TP391.41[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Transformer视觉特征融合的图像描述方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Transformer视觉特征融合的图像描述方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索