全景分割与多视觉特征协同的图像描述生成方法  

Image Description Generation Method by Panoptic Segmentation and Multi-Visual-Feature Fusion

在线阅读下载全文

作  者:刘明明 陆劲夫 刘浩 张海燕[1] LIU Mingming;LU Jinfu;LIU Hao;ZHANG Haiyan(School of Intelligent Manufacturing,Jiangsu Vocational Institute of Architectural Technology,Xuzhou 221116,Jiangsu,China;School of Computer Science and Technology,China University of Mining and Technology,Xuzhou 221116,Jiangsu,China)

机构地区:[1]江苏建筑职业技术学院智能制造学院,江苏徐州221116 [2]中国矿业大学计算机科学与技术学院,江苏徐州221116

出  处:《计算机工程》2024年第11期308-317,共10页Computer Engineering

基  金:国家自然科学基金(61801198);江苏省自然科学基金(BK20180174)。

摘  要:现有基于Transformer架构的图像描述生成模型取得了较好的泛化性能,然而,大多数方法通常使用区域视觉特征进行编解码,导致无法全面利用整幅图像的细粒度信息,且存在视觉特征混淆问题。为此,将全景分割引入图像描述生成过程,使用基于全景分割的掩膜视觉特征代替区域视觉特征,提出一种全景分割与多视觉特征协同的图像描述生成方法。该方法不仅可以有效解耦视觉表征,而且能够充分结合掩膜视觉特征和网格视觉特征的优势,提升图像描述生成的可解释性和描述性能。在MSCOCO标准数据集上进行定量和定性实验,结果表明,所提方法不仅可以显著提升现有模型的性能,同时能够增强图像描述生成过程的可解释性,CIDEr和BLEU-4指标分别达到138.5和41。Due to their powerful sequence modeling capabilities,Transformer-based image captioning models have demonstrated remarkable performance.However,most of these models typically utilize region visual features to perform encoding and decoding,which cannot fully use the fine-grained information of the whole image,and this leads to visual feature confusion.Accordingly,we introduce panoptic segmentation into the Transformer-based image captioning model by replacing the region visual feature with mask visual features and propose a novel image captioning model based on multi-visual-feature fusion.Our model not only disentangles the region visual features effectively but also makes use of both mask and grid visual features to improve image captioning performance.We perform quantitative and qualitative experiments on the MSCOCO dataset,which demonstrate that our method significantly outperforms existing Transformer-based image captioning models.In addition,our model enhances the interpretability of the caption generation process,and more specifically,achieves CIDEr and BLEU-4 scores of 138.5 and 41,respectively.

关 键 词:图像理解 图像描述生成 全景分割 特征融合 视觉编码 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象