融合时序与全局上下文特征增强的弱监督动作定位

Weakly supervised action localization based on temporal and global contextual feature enhancement

作　　者：党伟超[1] 范英豪高改梅[1] 刘春霞[1] DANG Weichao;FAN Yinghao;GAO Gaimei;LIU Chunxia(College of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan Shanxi 030024,China)

机构地区：[1]太原科技大学计算机科学与技术学院,太原030024

出　　处：《计算机应用》2025年第3期963-971,共9页journal of Computer Applications

基　　金：山西省自然科学基金资助项目(202203021211194);太原科技大学博士科研启动基金资助项目(20202063);太原科技大学研究生教育创新项目(SY2022063)。

摘　　要：针对现有的弱监督动作定位研究中将视频片段视为单独动作实例独立处理带来的动作分类及定位不准确问题,提出一种融合时序与全局上下文特征增强的弱监督动作定位方法。首先,构建时序特征增强分支以利用膨胀卷积扩大感受野,并引入注意力机制捕获视频片段间的时序依赖性;其次,设计基于高斯混合模型(GMM)的期望最大化(EM)算法捕获视频的上下文信息,同时利用二分游走传播进行全局上下文特征增强,生成高质量的时序类激活图(TCAM)作为伪标签在线监督时序特征增强分支;再次,通过动量更新网络得到体现视频间动作特征的跨视频字典;最后,利用跨视频对比学习提高动作分类的准确性。实验结果表明,交并比(IoU)取0.5时,所提方法在THUMOS'14和ActivityNet v1.3数据集上分别取得了42.0%和42.2%的平均精度均值(mAP),相较于CCKEE(Cross-video Contextual Knowledge Exploration and Exploitation)方法,在mAP分别提升了2.6与0.6个百分点,验证了所提方法的有效性。In view of the inaccuracy of action classification and localization caused by the independent processing of video clips as single action instances in the existing weakly supervised action localization studies,a weakly supervised action localization method that integrates temporal and global contextual feature enhancement was proposed.Firstly,the temporal feature enhancement branch was constructed to enlarge the receptive field by using dilated convolution,and the attention mechanism was introduced to capture the temporal dependency between video clips.Secondly,an EM(Expectation-Maximization)algorithm based on Gaussian Mixture Model(GMM)was designed to capture video context information.At the same time,global contextual feature enhancement was performed by using binary walk propagation.As the result,highquality Temporal Class Activation Maps(TCAMs)were generated as pseudo labels to supervise the temporal enhancement branch online.Thirdly,the momentum update network was used to obtain a cross-video dictionary that reflects the action features between videos.Finally,cross-video contrastive learning was used to improve the accuracy of action classification.Experimental results show that the proposed method has the mean Average Precision(mAP)of 42.0%and 42.2%on THUMOS’14 and ActivityNet v1.3 datasets when the Intersection-over-Union(IoU)is 0.5,and compared with CCKEE(Cross-video Contextual Knowledge Exploration and Exploitation),the proposed method has the mAP improved by 2.6 and 0.6 percentage points,respectively,proving the effectiveness of the proposed method.

关键词：弱监督动作定位时序类激活图动量更新伪标签监督特征增强

分类号：TP391.4[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合时序与全局上下文特征增强的弱监督动作定位

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合时序与全局上下文特征增强的弱监督动作定位

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索