检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:刘兵[1,2] 李穗[1,2] 刘明明 刘浩 LIU Bing;LI Sui;LIU Ming-ming;LIU Hao(School of Computer Science and Technology,China University of Mining and Technology,Xuzhou,Jiangsu 221116,China;Ministry of Education Engineering Research Center of Mine Digitization,Xuzhou,Jiangsu 221116,China)
机构地区:[1]中国矿业大学计算机科学与技术学院,江苏徐州221116 [2]矿山数字化教育部工程研究中心,江苏徐州221116
出 处:《电子学报》2024年第4期1305-1314,共10页Acta Electronica Sinica
基 金:国家自然科学基金(No.62276266,No.61801198)。
摘 要:多样化图像描述生成已成为图像描述领域研究热点.然而,现有方法忽视了全局和序列隐向量之间的依赖关系,严重限制了图像描述性能的提升.针对该问题,本文提出了基于混合变分Transformer的多样化图像描述生成框架.具体地,首先构建全局与序列混合条件变分自编码模型,解决全局与序列隐向量之间依赖关系表示的问题.其次,通过最大化条件似然推导混合模型的变分证据下界,解决多样化图像描述目标函数设计问题.最后,无缝融合Transformer和混合变分自编码模型,通过联合优化提升多样化图像描述的泛化性能.在MSCOCO数据集上实验结果表明,与当前最优基准方法相比,在随机生成20和100个描述语句时,多样性指标m-BLEU(mutual overlap-BiLingual Evaluation Understudy)分别提升了4.2%和4.7%,同时准确性指标CIDEr(Consensus-based Image Description Evaluation)分别提升了4.4%和15.2%.Diverse image captioning has become a research hotspot in the field of image description.Existing meth⁃ods generally ignore the dependency relationship between global and sequential latent vectors,which seriously limits the performance improvement.To address this problem,this paper proposes a hybrid variational Transformer based diverse im⁃age captioning framework.Firstly,we construct a hybrid conditional variational autoencoder to effectively model the depen⁃dency between global and sequential latent vectors.Secondly,the evidence lower bound is derived by maximizing the condi⁃tional likelihood of the hybrid autoencoder,which serves as the objective function for diverse image captioning.Finally,we seamlessly combine the Transformer model with the hybrid conditional variational autoencoder,which can be jointly opti⁃mized to improve the generalization performance of diverse image captioning.The experimental results on MSCOCO datas⁃et show that compared with the state-of-the-art methods,when randomly generating 20 and 100 captions,the diversity met⁃ric m-BLEU(Mutual overlap Bilingual Evaluation Under study)has improved by 4.2%and 4.7%,respectively,while the ac⁃curacy metric CIDEr(Consensus based Image Description Evaluation)has improved by 4.4%and 15.2%,respectively.
关 键 词:图像理解 图像描述 变分自编码 隐嵌入 多模态学习 生成模型
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.127