A Dual Stream Multimodal Alignment and Fusion Network for Classifying Short Videos

基于双流多模态对齐和融合的短视频分类网络

作　　者：ZHOU Ming WANG Tong 周明;王彤(东华大学信息科学与技术学院,上海201620;东华大学数字化纺织服装技术教育部工程研究中心,上海201620)

机构地区：[1]College of Information Science and Technology,Donghua University,Shanghai 201620,China [2]Engineering Research Center of Digitized Textile&Apparel Technology,Ministry of Education,Donghua University,Shanghai 201620,China

出　　处：《Journal of Donghua University(English Edition)》2025年第1期88-95,共8页东华大学学报(英文版)

基　　金：Fundamental Research Funds for the Central Universities,China(No.2232021A-10);National Natural Science Foundation of China(No.61903078);Shanghai Sailing Program,China(No.22YF1401300);Natural Science Foundation of Shanghai,China(No.20ZR1400400)。

摘　　要：Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and the modality fusion approach tends to be too simple,often neglecting modality alignment before fusion.This research introduces a novel dual stream multimodal alignment and fusion network named DMAFNet for classifying short videos.The network uses two unimodal encoder modules to extract features within modalities and exploits a multimodal encoder module to learn interaction between modalities.To solve the modality alignment problem,contrastive learning is introduced between two unimodal encoder modules.Additionally,masked language modeling(MLM)and video text matching(VTM)auxiliary tasks are introduced to improve the interaction between video frames and text modalities through backpropagation of loss functions.Diverse experiments prove the efficiency of DMAFNet in multimodal video classification tasks.Compared with other two mainstream baselines,DMAFNet achieves the best results on the 2022 WeChat Big Data Challenge dataset.视频分类是视频理解中的一项重要任务,在信息内容的智能监控中发挥着举足轻重的作用。大多数现有方法没有考虑视频的多模态性质,而且模态融合方法往往过于简单,常常忽略融合前的模态对齐。该文提出了一种用于短视频分类的双流多模态对齐和融合网络DMAFNet。该网络使用两个单模态编码器来提取模态内的特征,并且利用多模态编码器学习模态之间的交互。为了解决模态对齐问题,引入了两个单模态编码器之间的对比学习。此外,还设计了文本掩码建模和视频文本匹配辅助任务,通过损失函数的反向传播来改善视频帧和文本模态之间的交互。实验证明了DMAFNet在多模态视频分类任务中的有效性。与两种主流的方法相比,DMAFNet在2022年微信大数据挑战数据集上取得了最好的结果。

关键词：video classification multimodal fusion feature alignment

分类号：TP751.1[自动化与计算机技术—检测技术与自动化装置]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

A Dual Stream Multimodal Alignment and Fusion Network for Classifying Short Videos

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

A Dual Stream Multimodal Alignment and Fusion Network for Classifying Short Videos

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索