检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:周旭[1] 钱胜胜 李章明 方全[2] 徐常胜[2] ZHOU Xu;QIAN Sheng-sheng;LI Zhang-ming;FANG Quan;XU Chang-sheng(Henan Institute of Advanced Technology,Zhengzhou University,Zhengzhou 450000,China;National Key Laboratory of Pattern Recognition,Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China)
机构地区:[1]郑州大学河南先进技术研究院,郑州450000 [2]中国科学院自动化研究所模式识别国家重点实验室,北京100190
出 处:《计算机科学》2022年第9期132-138,共7页Computer Science
基 金:国家自然科学基金(61936005)。
摘 要:互联网的快速发展和社交媒体规模的不断扩大,带来丰富的社会事件资讯,社会事件分类任务越来越具有挑战性。充分利用图像级和文本级信息是社会事件分类的关键所在。然而,现存的方法大多存在以下局限性:1)现有的多模态方法大多都有一个理想的假设,即每种模态的样本都是充分和完整的,但在实际生活应用中这个假设并不总是成立,会存在事件某个模态缺失的情况;2)大部分方法只是简单地将社会事件的图像特征和文本特征串联起来,以此得到多模态特征来对社会事件进行分类,忽视了模态之间的语义鸿沟。为了应对这些挑战,提出了一种能同时处理完备与不完备社会事件分类的对偶变分多模态注意力网络(DVMAN)。在DVMAN网络中,提出了一个新颖的对偶变分自编码器网络来生成社会事件的公共表示,并进一步重构不完备社会事件学习中缺失的模态信息。通过分布对齐和交叉重构对齐,对图像和文本潜在表示进行双重对齐,以减小不同模态之间的差距,并对缺失的模态信息进行重构,合成其潜在表示。除此之外,设计了一个多模态融合模块对社会事件的图像和文本细粒度信息进行整合,以此实现模态之间信息的互补和增强。在两个公开的事件数据集上进行了大量的实验,与现有先进方法相比,DVMAN的准确率提升了4%以上,证明了所提方法对于社会事件分类的优越性能。The rapid development of the Internet and the continuous expansion of social media have brought a wealth of social event information, and the task of social event classification has become increasingly challenging.Making full use of image-level and text-level information is the key to social event classification.However, most of existing methods have the following limitations: 1) Most of the existing multi-modal methods have an ideal assumption that the samples of each modality are sufficient and complete, but in real applications this assumption does not always hold and there will be cases where a certain modality of events is missing;2) Most methods simply concatenate image features and text features of social events to obtain multi-modal features to classify social events.To address these challenges, this paper proposes a dual variational multi-modal attention network(DVMAN) for social event classification to address the limitations of these existing methods.In the DVMAN network, this paper proposes a novel dual variational autoencoders network to generate public representations of social events and further reconstruct the missing modal information in incomplete social event learning.Through distribution alignment and cross-reconstruction alignment, image and text latent representations are doubly aligned to mitigate the gap between different modalities, and for the mis-sing modality information, a generative model is utilized to synthesize its latent representations.In addition, this paper designs a multi-modal fusion module to integrate the fine-grained information of images and texts of social events, so as to realize the complementation and enhancement of information between modalities.This paper conducts extensive experiments on two publicly available event datasets,compared with the existing advanced methods, the accuracy of DVMAN improves by more than 4%.It demonstrates the superior performance of the proposed method for social event classification.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117