基于特征对齐融合的双波段图像描述生成方法

Dual⁃band image captioning generation method based on feature alignment fusion

作　　者：顾梦瑶蔺素珍[1] 晋赞霞李烽源 GU Mengyao;LIN Suzhen;JIN Zanxia;LI Fengyuan(College of Computer Science and Technology,North University of China,Taiyuan 030051,China)

机构地区：[1]中北大学计算机科学与技术学院,山西太原030051

出　　处：《现代电子技术》2025年第7期65-71,共7页Modern Electronics Technique

基　　金：山西省自然科学基金项目(202303021211147);山西省知识产权局专利转化专项计划(202302001);国家自然科学基金项目(62406296);山西省留学回国人员科技活动择优资助项目(20230017)。

摘　　要：为了获得更准确、全面的现场信息,采用红外和可见光同步成像探测复杂场景已成为常态,但现有图像描述研究仍集中于可见光图像,无法全面而准确地描述已探测到的场景信息。为此,文中提出一种基于特征对齐融合的可见光⁃红外双波段图像描述生成方法。首先,利用Faster⁃RCNN分别提取可见光图像的区域特征和红外图像的网格特征;其次,以Transformer为基本架构,在可见光⁃红外图像对齐融合(VIIAF)编码器中引入位置信息做桥接,进行可见光⁃红外图像特征的对齐与融合;接着,将融合得到的视觉信息输入Transformer解码器中得到粗粒度文本的隐藏状态;最后将编码器输出的视觉信息、解码器得到的隐藏状态与经训练的Bert输出的语言信息输入所设计的自适应模块,使视觉信息和语言信息参与文本预测,实现文本由粗到细的图像描述。在可见光图像⁃红外图像描述数据集上进行的多组实验表明:所提方法不仅能够精确捕捉到可见光和红外图像间的互补信息,而且与使用Transformer的最优模型相比,其性能在BLEU⁃1、BLEU⁃2、BLEU⁃3、BLEU⁃4、METROR、ROUGE以及CIDEr指标上分别提高1.9%、2.1%、2.0%、1.8%、1.3%、1.4%、4.4%。It has become a constant matter to detect complex scenes by infrared and visible light synchronous imaging and obtain more accurate and comprehensive on⁃site information.However,the existing research on image captioning still focuses on visible light images,and fails to describe the detected on⁃site information comprehensively and accurately.To this end,a visible⁃infrared dual⁃band image captioning generation method based on feature alignment fusion is proposed.Firstly,Faster⁃RCNN is used to extract the regional features of the visible image and the grid features of the infrared image,respectively.Secondly,on the basis of the Transformer,the position information is introduced into the visible⁃infrared image alignment fusion(VIIAF)encoder as a bridging to align and fuse the features of visible⁃infrared images.Then,the visual information obtained from fusion is input into the traditional Transformer decoder to get the hidden state of the coarse⁃grained text.Finally,the visual information output from the encoder,the hidden state obtained from the decoder,and the linguistic information output from the trained Bert are inputted into the designed adaptive module,so that the visual and linguistic information can be involved in the text prediction and achieve the change from the coarse⁃grained text image captioning to the fine⁃grained text image captioning.Multiple sets of experiments on the visible⁃infrared image captioning dataset show that the proposed method can accurately capture the complementary information between visible light images and infrared images.In addition,its performance is improved by 1.9%,2.1%,2.0%,1.8%,1.3%,1.4%and 4.4%on BLEU⁃1,BLEU⁃2,BLEU⁃3,BLEU⁃4,METROR,ROUGE and CIDEr,respectively,in comparison with the optimal model using Transformer.To sum up,the proposed method is of effectiveness.

关键词：图像描述双波段特征对齐融合注意力机制 TRANSFORMER 语言模型 Bert 自适应

分类号：TN911.73-34[电子电信—通信与信息系统] TP391[电子电信—信息与通信工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于特征对齐融合的双波段图像描述生成方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于特征对齐融合的双波段图像描述生成方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索