基于全局与序列混合变分Transformer的多样化图像描述生成方法被引量：4

Diverse Image Captioning Based on Hybrid Global and Sequential Variational Transformer

作　　者：刘兵[1,2] 李穗[1,2] 刘明明刘浩 LIU Bing;LI Sui;LIU Ming-ming;LIU Hao(School of Computer Science and Technology,China University of Mining and Technology,Xuzhou,Jiangsu 221116,China;Ministry of Education Engineering Research Center of Mine Digitization,Xuzhou,Jiangsu 221116,China)

机构地区：[1]中国矿业大学计算机科学与技术学院,江苏徐州221116 [2]矿山数字化教育部工程研究中心,江苏徐州221116

出　　处：《电子学报》2024年第4期1305-1314,共10页Acta Electronica Sinica

基　　金：国家自然科学基金(No.62276266,No.61801198)。

摘　　要：多样化图像描述生成已成为图像描述领域研究热点.然而,现有方法忽视了全局和序列隐向量之间的依赖关系,严重限制了图像描述性能的提升.针对该问题,本文提出了基于混合变分Transformer的多样化图像描述生成框架.具体地,首先构建全局与序列混合条件变分自编码模型,解决全局与序列隐向量之间依赖关系表示的问题.其次,通过最大化条件似然推导混合模型的变分证据下界,解决多样化图像描述目标函数设计问题.最后,无缝融合Transformer和混合变分自编码模型,通过联合优化提升多样化图像描述的泛化性能.在MSCOCO数据集上实验结果表明,与当前最优基准方法相比,在随机生成20和100个描述语句时,多样性指标m-BLEU(mutual overlap-BiLingual Evaluation Understudy)分别提升了4.2%和4.7%,同时准确性指标CIDEr(Consensus-based Image Description Evaluation)分别提升了4.4%和15.2%.Diverse image captioning has become a research hotspot in the field of image description.Existing meth⁃ods generally ignore the dependency relationship between global and sequential latent vectors,which seriously limits the performance improvement.To address this problem,this paper proposes a hybrid variational Transformer based diverse im⁃age captioning framework.Firstly,we construct a hybrid conditional variational autoencoder to effectively model the depen⁃dency between global and sequential latent vectors.Secondly,the evidence lower bound is derived by maximizing the condi⁃tional likelihood of the hybrid autoencoder,which serves as the objective function for diverse image captioning.Finally,we seamlessly combine the Transformer model with the hybrid conditional variational autoencoder,which can be jointly opti⁃mized to improve the generalization performance of diverse image captioning.The experimental results on MSCOCO datas⁃et show that compared with the state-of-the-art methods,when randomly generating 20 and 100 captions,the diversity met⁃ric m-BLEU(Mutual overlap Bilingual Evaluation Under study)has improved by 4.2%and 4.7%,respectively,while the ac⁃curacy metric CIDEr(Consensus based Image Description Evaluation)has improved by 4.4%and 15.2%,respectively.

关键词：图像理解图像描述变分自编码隐嵌入多模态学习生成模型

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于全局与序列混合变分Transformer的多样化图像描述生成方法被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于全局与序列混合变分Transformer的多样化图像描述生成方法 被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于全局与序列混合变分Transformer的多样化图像描述生成方法被引量：4