耦合单词与句子级文本特征的图像对抗级联生成

Image adversarial cascade generation via coupling word and sentence-level text features

作　　者：白志远杨智翔栾鸿康孙玉宝 BAI Zhi-yuan;YANG Zhi-xiang;LUAN Hong-kang;SUN Yu-bao(School of Computer Science,Nanjing University of Information Science&Techology,Nanjing 210044;Jiangsu Key Laboratory of Big Data Analysis Technology,School of Computer Science,Nanjing University of Information Science&Techology,Nanjing 210044,China)

机构地区：[1]南京信息工程大学计算机学院,江苏南京210044 [2]南京信息工程大学计算机学院江苏省大数据分析技术实验室,江苏南京210044

出　　处：《计算机工程与科学》2023年第12期2186-2196,共11页Computer Engineering & Science

基　　金：国家自然科学基金(U2001211,62276139)。

摘　　要：文本生成图像旨在根据自然语言描述生成逼真的图像,是一个涉及文本与图像的跨模态分析任务。鉴于生成对抗网络具有生成图像逼真、效率高等优势,已经成为文本生成图像任务的主流模型。然而,当前方法往往将文本特征分为单词级和句子级单独训练,文本信息利用不充分,容易导致生成的图像与文本不匹配的问题。针对该问题,提出了一种耦合单词级与句子级文本特征的图像对抗级联生成模型(Union-GAN),在每个图像生成阶段引入了文本图像联合感知模块(Union-Block),使用通道仿射变换和跨模态注意力相结合的方式,充分利用了文本的单词级语义与整体语义信息,促使生成的图像既符合文本语义描述又能够保持清晰结构。同时联合优化鉴别器,将空间注意力加入到对应的鉴别器中,使来自文本的监督信号促使生成器生成更加相关的图像。在CUB-200-2011数据集上将其与AttnGAN等多个当前的代表性模型进行了对比,实验结果表明,Union-GAN的FID分数达到了13.67,与AttnGAN相比,提高了42.9%,IS分数达到了4.52,提高了0.16。Text-to-image generation aims to generate realistic images from natural language descriptions,and is a cross-modal analysis task involving text and images.In view of the fact that the generative confrontation network has the advantages of realistic image generation and high efficiency,it has become the mainstream model for text generation image tasks.However,the current methods often divide text features into word-level and sentence-level training separately,and the text information is not fully utilized,which easily leads to the problem that the generated image does not match the text.In response to this problem,this paper proposes an image confrontation cascade generation model(Union-GAN)that couples word-level and sentence-level text features,and introduces a text-image joint perception module(Union-Block)in each image generation stage.By combining channel affine transformation and cross-modal attention,it fully utilizes the word-level semantic and overall semantic information of the text to generate images that not only match the text semantic description but also maintain clear structures.Meanwhile,jointly optimizing the discriminator and adding spatial attention to the corresponding discriminator allows the supervisory signal from the text to prompt the generator to generate more relevant images.Compared with multiple current representative networks such as AttnGAN on the CUB-200-2011 dataset,experimental results show that the FID score of our Union-GAN is 13.67,an increase of 42.9%compared to AttnGAN,and the IS score is 4.52,an increase of 0.16.

关键词：文本生成图像生成对抗网络多模态任务

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

耦合单词与句子级文本特征的图像对抗级联生成

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

耦合单词与句子级文本特征的图像对抗级联生成

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索