Robust video question answering via contrastive cross-modality representation learning

作　　者：Xun YANG Jianming ZENG Dan GUO Shanshan WANG Jianfeng DONG Meng WANG

机构地区：[1]School of Information Science and Technology,University of Science and Technology of China,Hefei 230026,China [2]School of Computer Science and Information Engineering,Hefei University of Technology,Hefei 230601,China [3]Institute of Artificial Intelligence,Hefei Comprehensive National Science Center,Hefei 230088,China [4]Institutes of Physical Science and Information Technology,Anhui University,Hefei 230601,China [5]School of Computer Science and Technology,Zhejiang Gongshang University,Hangzhou 310018,China

出　　处：《Science China(Information Sciences)》2024年第10期207-222,共16页中国科学（信息科学）（英文版）

基　　金：supported by National Natural Science Foundation of China(Grant Nos.62272435,U22A2094);Advanced Computing Resources Provided by the Supercomputing Center of the University of Science and Technology of China(USTC)。

摘　　要：Video question answering(VideoQA)is a challenging yet important task that requires a joint understanding of low-level video content and high-level textual semantics.Despite the promising progress of existing efforts,recent studies revealed that current VideoQA models mostly tend to over-rely on the superficial correlations rooted in the dataset bias while overlooking the key video content,thus leading to unreliable results.Effectively understanding and modeling the temporal and semantic characteristics of a given video for robust VideoQA is crucial but,to our knowledge,has not been well investigated.To fill the research gap,we propose a robust VideoQA framework that can effectively model the cross-modality fusion and enforce the model to focus on the temporal and global content of videos when making a QA decision instead of exploiting the shortcuts in datasets.Specifically,we design a self-supervised contrastive learning objective to contrast the positive and negative pairs of multimodal input,where the fused representation of the original multimodal input is enforced to be closer to that of the intervened input based on video perturbation.We expect the fused representation to focus more on the global context of videos rather than some static keyframes.Moreover,we introduce an effective temporal order regularization to enforce the inherent sequential structure of videos for video representation.We also design a Kullback-Leibler divergence-based perturbation invariance regularization of the predicted answer distribution to improve the robustness of the model against temporal content perturbation of videos.Our method is model-agnostic and can be easily compatible with various VideoQA backbones.Extensive experimental results and analyses on several public datasets show the advantage of our method over the state-of-the-art methods in terms of both accuracy and robustness.

关键词：video question answering cross-modality fusion contrastive learning cross-media reasoning

分类号：TP391.41[自动化与计算机技术—计算机应用技术] TP391.1[自动化与计算机技术—计算机科学与技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Robust video question answering via contrastive cross-modality representation learning

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Robust video question answering via contrastive cross-modality representation learning

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索