检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:ZHOU Ming WANG Tong 周明;王彤(东华大学信息科学与技术学院,上海201620;东华大学数字化纺织服装技术教育部工程研究中心,上海201620)
机构地区:[1]College of Information Science and Technology,Donghua University,Shanghai 201620,China [2]Engineering Research Center of Digitized Textile&Apparel Technology,Ministry of Education,Donghua University,Shanghai 201620,China
出 处:《Journal of Donghua University(English Edition)》2025年第1期88-95,共8页东华大学学报(英文版)
基 金:Fundamental Research Funds for the Central Universities,China(No.2232021A-10);National Natural Science Foundation of China(No.61903078);Shanghai Sailing Program,China(No.22YF1401300);Natural Science Foundation of Shanghai,China(No.20ZR1400400)。
摘 要:Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and the modality fusion approach tends to be too simple,often neglecting modality alignment before fusion.This research introduces a novel dual stream multimodal alignment and fusion network named DMAFNet for classifying short videos.The network uses two unimodal encoder modules to extract features within modalities and exploits a multimodal encoder module to learn interaction between modalities.To solve the modality alignment problem,contrastive learning is introduced between two unimodal encoder modules.Additionally,masked language modeling(MLM)and video text matching(VTM)auxiliary tasks are introduced to improve the interaction between video frames and text modalities through backpropagation of loss functions.Diverse experiments prove the efficiency of DMAFNet in multimodal video classification tasks.Compared with other two mainstream baselines,DMAFNet achieves the best results on the 2022 WeChat Big Data Challenge dataset.视频分类是视频理解中的一项重要任务,在信息内容的智能监控中发挥着举足轻重的作用。大多数现有方法没有考虑视频的多模态性质,而且模态融合方法往往过于简单,常常忽略融合前的模态对齐。该文提出了一种用于短视频分类的双流多模态对齐和融合网络DMAFNet。该网络使用两个单模态编码器来提取模态内的特征,并且利用多模态编码器学习模态之间的交互。为了解决模态对齐问题,引入了两个单模态编码器之间的对比学习。此外,还设计了文本掩码建模和视频文本匹配辅助任务,通过损失函数的反向传播来改善视频帧和文本模态之间的交互。实验证明了DMAFNet在多模态视频分类任务中的有效性。与两种主流的方法相比,DMAFNet在2022年微信大数据挑战数据集上取得了最好的结果。
关 键 词:video classification multimodal fusion feature alignment
分 类 号:TP751.1[自动化与计算机技术—检测技术与自动化装置]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.179