检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:党伟超[1] 范英豪 高改梅[1] 刘春霞[1] DANG Weichao;FAN Yinghao;GAO Gaimei;LIU Chunxia(College of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan Shanxi 030024,China)
机构地区:[1]太原科技大学计算机科学与技术学院,太原030024
出 处:《计算机应用》2025年第3期963-971,共9页journal of Computer Applications
基 金:山西省自然科学基金资助项目(202203021211194);太原科技大学博士科研启动基金资助项目(20202063);太原科技大学研究生教育创新项目(SY2022063)。
摘 要:针对现有的弱监督动作定位研究中将视频片段视为单独动作实例独立处理带来的动作分类及定位不准确问题,提出一种融合时序与全局上下文特征增强的弱监督动作定位方法。首先,构建时序特征增强分支以利用膨胀卷积扩大感受野,并引入注意力机制捕获视频片段间的时序依赖性;其次,设计基于高斯混合模型(GMM)的期望最大化(EM)算法捕获视频的上下文信息,同时利用二分游走传播进行全局上下文特征增强,生成高质量的时序类激活图(TCAM)作为伪标签在线监督时序特征增强分支;再次,通过动量更新网络得到体现视频间动作特征的跨视频字典;最后,利用跨视频对比学习提高动作分类的准确性。实验结果表明,交并比(IoU)取0.5时,所提方法在THUMOS'14和ActivityNet v1.3数据集上分别取得了42.0%和42.2%的平均精度均值(mAP),相较于CCKEE(Cross-video Contextual Knowledge Exploration and Exploitation)方法,在mAP分别提升了2.6与0.6个百分点,验证了所提方法的有效性。In view of the inaccuracy of action classification and localization caused by the independent processing of video clips as single action instances in the existing weakly supervised action localization studies,a weakly supervised action localization method that integrates temporal and global contextual feature enhancement was proposed.Firstly,the temporal feature enhancement branch was constructed to enlarge the receptive field by using dilated convolution,and the attention mechanism was introduced to capture the temporal dependency between video clips.Secondly,an EM(Expectation-Maximization)algorithm based on Gaussian Mixture Model(GMM)was designed to capture video context information.At the same time,global contextual feature enhancement was performed by using binary walk propagation.As the result,highquality Temporal Class Activation Maps(TCAMs)were generated as pseudo labels to supervise the temporal enhancement branch online.Thirdly,the momentum update network was used to obtain a cross-video dictionary that reflects the action features between videos.Finally,cross-video contrastive learning was used to improve the accuracy of action classification.Experimental results show that the proposed method has the mean Average Precision(mAP)of 42.0%and 42.2%on THUMOS’14 and ActivityNet v1.3 datasets when the Intersection-over-Union(IoU)is 0.5,and compared with CCKEE(Cross-video Contextual Knowledge Exploration and Exploitation),the proposed method has the mAP improved by 2.6 and 0.6 percentage points,respectively,proving the effectiveness of the proposed method.
关 键 词:弱监督动作定位 时序类激活图 动量更新 伪标签监督 特征增强
分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.119.131.79