基于多模态掩码Transformer网络的社会事件分类  

Multi-modal mask Transformer network for social event classification

在线阅读下载全文

作  者:陈宏[1] 钱胜胜 李章明 方全[2] 徐常胜[2] CHEN Hong;QIAN Shengsheng;LI Zhangming;FANG Quan;XU Changsheng(Henan Institute of Advanced Technology,Zhengzhou University,Zhengzhou 450000,China;Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China)

机构地区:[1]郑州大学河南先进技术研究院,郑州450000 [2]中国科学院自动化研究所,北京100190

出  处:《北京航空航天大学学报》2024年第2期579-587,共9页Journal of Beijing University of Aeronautics and Astronautics

基  金:国家自然科学基金(61832002)。

摘  要:多模态社会事件分类的关键是充分且准确地利用图像和文字2种模态的特征。然而,现有的大多数方法存在以下局限性:简单地将事件的图像特征和文本特征连接起来,不同模态之间存在不相关的上下文信息导致相互干扰。因此,仅仅考虑多模态数据模态间的关系是不够的,还要考虑模态之间不相关的上下文信息(即区域或单词)。为克服这些局限性,提出一种新颖的基于多模态掩码Transformer网络(MMTN)模型的社会事件分类方法。通过图-文编码网络来学习文本和图像的更好的表示。将获得的图像和文本表示输入多模态掩码Transformer网络来融合多模态信息,并通过计算多模态信息之间的相似性,对多模态信息的模态间的关系进行建模,掩盖模态之间的不相关上下文。在2个基准数据集上的大量实验表明:所提模型达到了最先进的性能。Utilizing both the properties of the text and image modalities to the fullest extent possible is essential for multi-modal social event classification.However,most of the existing methods have the following limitations:They simply concatenate the image features and textual features of events.The existence of irrelevant contextual information between different modalities leads to mutual interference.Therefore,it is not enough to only consider the relationship between modalities of multimodal data,but also consider irrelevant contextual information between modalities(such as regions or words).To overcome these limitations,this paper proposes a novel social event classification method based on multimodal mask transformer network(MMTN)model.Specifically,the authors learn better representations of text and images through an image-text encoding network.To combine multimodal data,the resultant picture and word representations are input into a multimodal mask Transformer network.By calculating the similarity between the multimodal information,the relationship between the modalities of the multimodal information is modeled,and the irrelevant contexts between the modalities are masked.Extensive experiments on two benchmark datasets demonstrate that the proposed model achieves the state-of-the-art performance.

关 键 词:多模态 社会事件分类 社交媒体 表示学习 多模态Transformer网络 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象