检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:程勇[1] 高园元 王军[1] 杨玲[1] 许小龙[1] 程遥 张开华 Cheng Yong;Gao Yuanyuan;Wang Jun;Yang Ling;Xu Xiaolong;Cheng Yao;Zhang Kaihua(School of Software,Nanjing University of Information and Science Technology,Nanjing 210044,China;School of Computer Science,Nanjing University of Information and Science Technology,Nanjing 210044,China)
机构地区:[1]南京信息工程大学软件学院,南京210044 [2]南京信息工程大学计算机学院,南京210044
出 处:《中国图象图形学报》2025年第2期406-420,共15页Journal of Image and Graphics
基 金:国家自然科学基金项目(41975183,41875184)。
摘 要:目的时空动作检测任务旨在预测视频片段中所有动作的时空位置及对应类别。然而,现有方法大多关注行动者的视觉和动作特征,忽视与行动者交互的全局上下文信息。针对当前方法的不足,提出一种结合扩张卷积与多尺度融合的高效时空动作检测模型(efficient action detector,EAD)。方法首先,利用轻量级双分支网络同时建模关键帧的静态信息和视频片段的动态时空信息。其次,利用分组思想构建轻量空间扩张增强模块提取全局性的上下文信息。然后,构建多种DO-Conv结构组成的多尺度特征融合单元,实现多尺度特征捕获与融合。最后,将不同层次的特征分别送入预测头中进行检测。结果实验在数据集UCF101-24和AVA(atomic visual actions)中进行,分析了EAD与现有算法之间的检测对比结果。在UCF101-24数据集上的帧平均准确度(frame-mAP)和视频平均准确度(video-mAP)分别为80.93%和50.41%,对于基线方法的漏检、错检现象有所改善;在AVA数据集上的frame-mAP达到15.92%,同时保持较低的计算开销。结论通过与基线及目前主流方法比较,EAD以较低的计算成本建模全局关键信息,提高了实时动作检测准确度。Objective Spatial-temporal action detection(STAD)represents a significant challenge in the field of video understanding.The objective is to identify the temporal and spatial localization of actions occurring in a video and catego⁃rize related action classes.The majority of existing methods rely on the backbone for the feature modeling of video clips,which captures only local features and ignores the global contextual information of the interaction with the actors.The results are represented by a model that cannot fully comprehend the nuances of the entire scene.The current mainstream methods for real-time STAD tasks are dual-stream network-based methods.However,a simple channel-by-channel connec⁃tion is typically employed to handle dual-branch network fusion,which results in a significant redundancy of the fused fea⁃tures and certain semantic differences in the branch features.This scheme affects the accuracy of the model.Here,an effi⁃cient STAD model called the efficient action detector(EAD),which can address the shortcomings of current methods,is proposed.Method The EAD model consists of three key components:the 2D branch,the 3D branch,and the fusion head.Among them,the 2D branch consists of a pretrained 2D backbone network,feature pyramid,and decoupling head;the 3D branch consists of a 3D backbone network and augmentation module;and the fusion head consists of a multiscale feature fusion unit(MSFFU)and a prediction head.First,key frames are extracted from the video clips and fed into the pretrained 2D branch backbone(YOLOv7)to detect the actors in the scene and obtain spatially decoupled features,which are classifi⁃cation features and localization features.Video spatial-temporal features are extracted from video clips via a pretrained lightweight video backbone network(Shufflenetv2).Second,the lightweight spatial dilated augmented module(LSDAM)uses the grouping idea to address spatial-temporal features,which serves to save resources.LSDAM consists of a dilated module(DM)and a spatial augmented module
关 键 词:深度学习 时空动作检测(STAD) 双分支网络 扩张增强模块(DAM) 多尺度融合
分 类 号:TP391.41[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49