基于场景图感知的跨模态图像描述模型

Scene graph-aware cross-modal image captioning model

作　　者：朱志平杨燕[1] 王杰[1] ZHU Zhiping;YANG Yan;WANG Jie(College of Computing and Artificial Intelligence,Southwest Jiaotong University,Chengdu Sichuan 611756,China)

机构地区：[1]西南交通大学计算机与人工智能学院,成都611756

出　　处：《计算机应用》2024年第1期58-64,共7页journal of Computer Applications

基　　金：国家自然科学基金资助项目(61976247)。

摘　　要：针对图像描述方法中对图像文本信息的遗忘及利用不充分问题,提出了基于场景图感知的跨模态交互网络(SGC-Net)。首先,使用场景图作为图像的视觉特征并使用图卷积网络(GCN)进行特征融合,从而使图像的视觉特征和文本特征位于同一特征空间;其次,保存模型生成的文本序列,并添加对应的位置信息作为图像的文本特征,以解决单层长短期记忆(LSTM)网络导致的文本特征丢失的问题;最后,使用自注意力机制提取出重要的图像信息和文本信息后并对它们进行融合,以解决对图像信息过分依赖以及对文本信息利用不足的问题。在Flickr30K和MSCOCO(MicroSoft Common Objects in COntext)数据集上进行实验的结果表明,与Sub-GC相比,SGC-Net在BLEU1(BiLingual Evaluation Understudy with 1-gram)、BLEU4(BiLingual Evaluation Understudy with 4-grams)、METEOR(Metric for Evaluation of Translation with Explicit ORdering)、ROUGE(Recall-Oriented Understudy for Gisting Evaluation)和SPICE(Semantic Propositional Image Caption Evaluation)指标上分别提升了1.1、0.9、0.3、0.7、0.4和0.3、0.1、0.3、0.5、0.6。可见,SGC-Net所使用的方法能够有效提升模型的图像描述性能及生成描述的流畅度。Aiming at the forgetting and underutilization of the text information of image in image captioning methods,a Scene Graph-aware Cross-modal Network(SGC-Net)was proposed.Firstly,the scene graph was utilized as the image’s visual features,and the Graph Convolutional Network(GCN)was utilized for feature fusion,so that the visual and textual features were in the same feature space.Then,the text sequence generated by the model was stored,and the corresponding position information was added as the textual features of the image,so as to solve the problem of text feature loss brought by the single-layer Long Short-Term Memory(LSTM)Network.Finally,to address the issue of over dependence on image information and underuse of text information,the self-attention mechanism was utilized to extract significant image information and text information and fuse then.Experimental results on Flickr30K and MS-COCO(MicroSoft Common Objects in COntext)datasets demonstrate that SGC-Net outperforms Sub-GC on the indicators BLEU1(BiLingual Evaluation Understudy with 1-gram),BLEU4(BiLingual Evaluation Understudy with 4-grams),METEOR(Metric for Evaluation of Translation with Explicit ORdering),ROUGE(Recall-Oriented Understudy for Gisting Evaluation)and SPICE(Semantic Propositional Image Caption Evaluation)with the improvements of 1.1,0.9,0.3,0.7,0.4 and 0.3,0.1,0.3,0.5,0.6,respectively.It can be seen that the method used by SGC-Net can increase the model’s image captioning performance and the fluency of the generated description effectively.

关键词：图像描述场景图注意力机制长短期记忆网络特征融合

分类号：TP391.1[自动化与计算机技术—计算机应用技术] TP18[自动化与计算机技术—计算机科学与技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于场景图感知的跨模态图像描述模型

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于场景图感知的跨模态图像描述模型

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索