基于视觉和语言感知增强的图像描述生成模型

VALRT: Vision and language reinforcement Transformerfor image caption

作　　者：彭玉青陈姣高萱任梓瑜 PENG Yu-qing;CHEN Jiao;GAO Xuan;REN Zi-yu(College of Artificial Intelligence,Hebei University of Technology,Tianjin 300401,China;Key Laboratory of Big Data Computing,Hebei University of Technology,Tianjin 300401,China)

机构地区：[1]河北工业大学人工智能与数据科学学院,天津300401 [2]河北工业大学大数据计算重点实验室,天津300401

出　　处：《计算机工程与设计》2025年第1期223-229,共7页Computer Engineering and Design

基　　金：河北省自然科学基金项目(F2021202038)。

摘　　要：为解决Transformer未充分利用低层编码器视觉信息和解码器中已生成单词信息不断被稀释的问题,提出一种用于图像描述的增强视觉与语言信息的Transformer架构,即VALRT模型。通过在基础Transformer模型上建立一个视觉感知增强模块(VR),以阶梯式方法融合低级和高级视觉编码特征,增强视觉特征表示;构建一个语言感知增强模块(LR),通过增强在预测单词时已生成单词信息的贡献,提升预测单词准确性。为验证模型的有效性,将VALRT模型在MSCOCO基准测试集上进行测试,实验结果表明,VALRT模型拥有更好的性能,能生成更准确、更细粒度的描述。To address the problems of underutilization of visual information from the lower-level encoder and the continuous dilution of word information in the decoder in Transformers,a Transformer-based architecture called the VALRT model was proposed for image captioning.The enhanced visual and language information was incorporated.A visual perception enhancement module(VR)was built on top of the base Transformer model,in which low-level and high-level visual encoding features were combined in a hierarchical manner to enhance visual feature representation.A language perception enhancement module(LR)was constructed to enhance the contribution of previously generated word information in predicting the next word,thereby improving the accuracy of word predictions.To demonstrate the effectiveness of the model,the VALRT model was tested on the MSCOCO benchmark dataset.Experimental results show that the VALRT model outperforms other models,generating more accurate and fine-grained descriptions.

关键词：图像描述 TRANSFORMER 深度学习注意力机制多模态编码器解码器

分类号：TP391.4[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于视觉和语言感知增强的图像描述生成模型

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于视觉和语言感知增强的图像描述生成模型

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索