机构地区:[1]合肥工业大学计算机与信息学院,合肥230601 [2]中国科学技术大学计算机科学与技术学院,合肥230027
出 处:《计算机学报》2021年第3期476-490,共15页Chinese Journal of Computers
基 金:国家杰出青年科学基金(61325010);国家自然科学基金(61403358);中央高校基本科研业务费专项资金资助。
摘 要:近年来,图像文本建模研究已经成为自然语言处理领域一个重要的研究方向.图像常被用于增强句子的语义理解与表示.然而也有研究人员对图像信息用于句子语义理解的必要性提出质疑,原因是文本本身就能够提供强有力的先验知识,帮助模型取得非常好的效果;甚至在不使用图像的条件下就能得出正确的答案.因此研究图像文本建模需要首先回答一个问题:图像是否有助于句子语义的理解与表示?为此,本文选择一个典型的不包含图像的自然语言语义理解任务:自然语言推理,并将图像信息引入到该任务中用于验证图像信息的有效性.由于自然语言推理任务是一个单一的自然语言任务,在数据标注过程中没有考虑图像信息,因此选择该任务能够更客观地分析出图像信息对句子语义理解与表示的影响.具体而言,本文提出一种通用的即插即用框架(general plug and play framework)用于图像信息的整合.基于该框架,本文选择目前最先进的五个自然语言推理模型,对比分析这些模型在使用图像信息前后的表现,以及使用不同图像处理模型与不同图像设置时的表现.最后,本文在一个大规模公开数据集上进行了大量实验,实验结果证实图像作为额外知识,确实有助于句子语义的理解与表示.此外,还证实了不同的图像处理模型和使用方法对整个模型的表现也会造成不同的影响.Recently,the Visual-to-Language(V2 L)problem has attracted more and more attention and become an important research topic in natural language processing.By utilizing Convolutional Neural Networks(CNN),Recurrent Neural Networks(RNN),and Attention Mechanism,researchers have made full use of images and achieved much progress in V2 L problem,especially in the area of natural language semantic understanding.In fact,images are often treated as the important auxiliary information to enhance the sentence semantic understanding.However,some researchers have questioned the necessity of using images for such understanding enhancement.They argue that textual information has already provided a very strong prior to promise the good performance of most semantic understanding models,which are even capable of generating correct answers without the consideration of images in some scenarios.Thus,the first crucial problem of V2 L research should be addressed is whether the image information is really necessary and helpful for sentence semantic understanding and representation.To this end,in this paper,we focus on a typical sentence semantic understanding task without images,Natural Language Inference(NLI),which requires an agent to determine the semantic relation between two sentences.Then,we incorporate images as the auxiliary information into the sentence pair to verify their effect.Since it is originally a pure natural language task and images are not considered to be used during the whole process of data annotation and sentence semantic modeling,choosing NLI task for evaluation can help to assess the influence of image information on sentence semantic understanding and representation more objectively.To be specific,we first design a general plug and play framework for image utilization and integration,which consists of four general layers,i.e.,Input Embedding layer,Contextual Encoding Layer,Interaction Layer,and Label Prediction Layer,and two plug and play layers,i.e.,Fine-Grained Context-Enhanced Layer and Coarse-Grained Contex
关 键 词:图像文本建模 句子语义理解与表示 图像信息 即插即用框架 自然语言推理
分 类 号:TP301[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...