基于Transformer的多方面特征编码图像描述生成算法被引量：4

Multifaceted Feature Coding Image Caption Generation Algorithm Based on Transformer

作　　者：衡红军[1] 范昱辰王家亮[1] HENG Hongjun;FAN Yuchen;WANG Jialiang(School of Computer Science and Technology,Civil Aviation University of China,Tianjin 300300,China)

机构地区：[1]中国民航大学计算机科学与技术学院,天津300300

出　　处：《计算机工程》2023年第2期199-205,共7页Computer Engineering

基　　金：国家自然科学基金(U1333109)。

摘　　要：由目标检测算法提取的目标特征在图像描述生成任务中发挥重要作用,但仅使用对图像进行目标检测的特征作为图像描述任务的输入会导致除关键目标信息以外的其余信息获取缺失,且生成的文本描述对图像内目标之间的关系缺乏准确表达。针对上述不足,提出用于编码图像内目标特征的目标Transformer编码器,以及用于编码图像内关系特征的转换窗口Transformer编码器,从不同角度对图像内不同方面的信息进行联合编码。通过拼接方法将目标Transformer编码的目标特征与转换窗口Transformer编码的关系特征相融合,达到图像内部关系特征和局部目标特征融合的目的,最终使用Transformer解码器将融合后的编码特征解码生成对应的图像描述。在MS-COCO数据集上进行实验,结果表明,所构建模型性能明显优于基线模型,BLEU-4、METEOR、ROUGE-L、CIDEr指标分别达到38.6%、28.7%、58.2%和127.4%,优于传统图像描述网络模型,能够生成更详细准确的图像描述。Object features extracted by object detection algorithms play an increasingly critical role in the generation of image captions.However,only using the features of object detection as the input of an image caption task can lead to the loss of other information except the key object information and generation of a caption that lacks an accurate expression of its relationship with the image object.To solve these disadvantages,an object Transformer encoder for encoding object features in an image and a shift window Transformer for encoding relational features in an image are proposed to make joint efforts to encode different aspects of information in an image.The object features of the object Transformer encoder are fused with the relational features of the shift window Transformer by splicing method,to achieve the purpose of fusion of the internal relational and local object features.Finally,a Transformer decoder is utilized to decode the fused coding features and generate the corresponding image caption.Extensive experiments on the Common Objects in COntext(MS-COCO) dataset and comparison with the current classical model algorithm show that the performance of the proposed model is significantly better than that of the baseline model.The experimental results indicate that the scores of BiLingual Evaluation Understudy 4-gram(BLEU-4),Metric for Evaluation of Translation with Explicit ORdering(METEOR),Recall-Oriented Understudy for Gisting Evaluation-Longest common subsequence(ROUGE-L),and Consensus-based Image Description Evaluation(CIDEr) metrics can reach 38.6%,28.7%,58.2% and 127.4%respectively,better than those of the traditional image caption algorithm.Moreover,it can generate more detailed and accurate captions.

关键词：图像描述转换窗口多头注意力机制多模态任务 Transformer编码器

分类号：TP391.41[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Transformer的多方面特征编码图像描述生成算法被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Transformer的多方面特征编码图像描述生成算法 被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于Transformer的多方面特征编码图像描述生成算法被引量：4