机构地区:[1]同济大学电子与信息工程学院,上海
出 处:《计算机科学与应用》2023年第12期2432-2446,共15页Computer Science and Application
摘 要:视觉问答是一项具有挑战性的多模态任务,它连接了计算机视觉和自然语言处理两个领域。在这项任务中,模型需要根据给定的图片和相关问题,有效地提取信息并给出正确答案。然而,由于图像和文本属于不同的模态,存在着严重的语义差异,因此如何有效地将不同模态的信息对齐并减少语义差异,是当前视觉问答领域的重点关注问题。本文针对当前视觉问答方法在多模态对齐阶段图像和文本信息颗粒度的巨大差异,提出了基于视觉离散化(PDID: Pixel Discretization and Instance Discretization)的智能问答模型并辅助以模态注意力机制完成跨模态信息和语义对齐。图像以像素为最小单位的特征数据与文本以单词为最小单位的特征数据,它们在数据的信息颗粒度上存在巨大的差异,即语言通过至多数万单词即可完成整个文本语义空间的构建,而图像则是通过亿级的RGB三原色数组构建而成。这说明了直接建模以像素为单位的图像是很难和文本做好对齐的。本文通过了多种图像离散化的方式,一方面通过离散化图像像素,以颜色离散化、强度离散化、纹理离散化、空间离散化四种形式将图像像素完成离散化,在数量级上逼近文本特征的最小基元数量;另一方面通过图像语义特征的软编码,离散化图像深层次的语义特征,将图像的语义特征与文本的单词语义对齐,在语义层面上逼近文本特征的单词语义信息量。除此以外,本文提出了一种新型的视觉关系融合模块,视觉关系融合模块用来捕获同种模态内离散化特征和连续特征的交互信息,为模型提供丰富的视觉特征。本文先使用自注意力方法提取模态内特征之间的相关性,即提取视觉全局关系,再使用通道空间分离注意力进行跨模态结合,为局部引导的全局特征提供更大的表示空间和更多的补充信息。为了验证本方法的有效性,在VQVisual question answering is a challenging multimodal task that bridges the fields of computer vi-sion and natural language processing. In this task, the model needs to effectively extract infor-mation and give the correct answer based on the given picture and related questions. However, since images and texts belong to different modalities, there are serious semantic differences. Therefore, how to effectively align information from different modalities and reduce semantic dif-ferences is a key concern in the current field of visual question answering. In view of the huge dif-ference in the granularity of image and text information in the multi-modal alignment stage of cur-rent visual question answering methods, this paper proposes an intelligent question answering model based on visual discretization (PDID: Pixel Discretization and Instance Discretization) and is assisted by a modal attention mechanism, cross-modal information and semantic alignment. There is a huge difference in the information granularity of the feature data of images with pixels as the smallest unit and the feature data of text with words as the smallest unit. That is, language can complete the construction of the entire text semantic space with up to tens of thousands of words, and the image is constructed from a billion-level RGB three primary color array. This shows that it is difficult to align the image with the text by directly modeling the image in pixels. This article adopts a variety of image discretization methods. On the one hand, it discretizes image pixels and discre-tizes image pixels in four forms: color discretization, intensity discretization, texture discretization, and space discretization, approaching text in an order of magnitude. The minimum number of primitives of the feature;on the other hand, through soft coding of image semantic features, the deep-level semantic features of the image are discretized, the semantic features of the image are aligned with the word semantics of the text, and the word semantic information
关 键 词:VQA 像素离散化 语义离散化 自注意力 跨模态融合
分 类 号:TP3[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...