基于跨媒体解纠缠表示学习的风格化图像描述生成  被引量:1

A Stylized Image Caption Approach Based on Cross-Media Disentangled Representation Learning

在线阅读下载全文

作  者:蔺泽浩 李国趸 曾祥极 邓悦 张寅[1] 庄越挺[1] LIN Ze-Hao;LI Guo-Dun;ZENG Xiang-Ji;DENG Yue;ZHANG Yin;ZHUANG Yue-Ting(Department of Computer Science and Technology,Zhejiang University,Hangzhou 310027)

机构地区:[1]浙江大学计算机科学与技术学院,杭州310027

出  处:《计算机学报》2022年第12期2510-2527,共18页Chinese Journal of Computers

基  金:国家自然科学基金(62072399,61402403,U19B2042);中国工程科技知识中心;数字图书馆教育部工程研究中心;中国工程科技数据和知识技术研究中心;中央高校基本科研业务费和百度人工智能课题基金资助.

摘  要:风格化图像描述生成的文本不仅被要求在语义上与给定的图像一致,而且还要与给定的语言风格保持一致.随着神经网络在计算机视觉和自然语言生成领域的技术发展,有关这个主题的最新研究取得了显著进步.但是,神经网络模型作为一种黑盒系统,人类仍然很难理解其隐层空间中参数所代表的风格、事实及它们之间的关系.为了提高对隐层空间中包含的事实内容和语言风格属性的理解以及增强对两者的控制能力,提高神经网络的可控性和可解释性,本文提出了一种使用解纠缠技术的新型风格化图像描述生成模型Disentangled Stylized Image Caption(DSIC).该模型分别从图像和描述文本中非对齐地学习解纠缠表示,具体使用了两个解纠缠表示学习模块——D-Images和D-Captions来分别学习图像和图像描述中解纠缠的事实信息和风格信息.在推理阶段,DSIC模型利用图像描述生成解码器以及一种特别设计的基于胶囊网络的信息聚合方法来充分利用先前学习的跨媒体信息表示,并通过直接控制隐层向量来生成目标风格的图像描述.本文在SentiCap数据集和FlickrStyle10K数据集上进行了相关实验.解纠缠表示学习的实验结果证明了模型解纠缠的有效性,而风格化图像描述生成实验结果则证明了聚合的跨媒体解纠缠表示可以带来更好的风格化图像描述生成性能,相对于对比的风格化图像描述生成模型,本文方法在多个指标上的性能提升了17%至86%.The task of stylized image caption aims to generate a natural language description that is semantically related to a given image and consistent with a given linguistic style.Both requirements make this task significantly more difficult than the traditional image caption task.However,with the availability of the large-scale image-text corpora and advances in deep learning techniques of computer vision and natural language processing,stylized image caption research has made significant advances in recent years.Widely adopted neural networks have demonstrated their powerful abilities to handle the complexities and challenges of the stylized image caption task.A typical stylized image caption model is usually an encoder-decoder architecture.The model inputs go through many layers of non-linear transformations,e.g.ReLU layer in the Convolutional Neural Networks(CNNs),to yield latent representations.This makes the latent representations and parameters of model lack interpretability and controllability,which can restrict the understanding of this task and its further improvement.In this paper,we focus on the problem of understanding and controlling the latent representations of linguistic style and factual content in stylized image caption models by learning disentangled representations.Existing disentanglement methods mainly work on single modal data,such as computer vision or natural language processing.However,in stylized image caption,there are two types of media,images and texts,involved to learn a representation that is faithful to the underlying data structure.How to disentangle the latent space of cross-media data still needs to be explored.Inspired by the successful applications of disentangled representation learning on Computer Vision and Natural Language Processing,we propose a novel approach,Disentangled Stylized Image Caption(DSIC),to learn the disentangled representations on unparallel cross-media data.With the help of the VAE framework,two latent space filter modules,style filter and fact filter,are desi

关 键 词:跨媒体 机器学习 解纠缠表示学习 风格化图像描述生成 自然语言生成 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象