机构地区:[1]天津理工大学计算机科学与工程学院,天津300380 [2]北京邮电大学人工智能学院,北京100876 [3]澳门科技大学计算机科学与工程学院,中国澳门999078
出 处:《中国图象图形学报》2025年第3期842-854,共13页Journal of Image and Graphics
基 金:国家自然科学基金项目(62376196,62036012,U23A20387,62106262,62202331,62206200,62276118,62376037);天津市自然科学基金项目(24JCJQJC00190,22JCYBJC00030)。
摘 要:目的弱监督时序动作定位仅利用视频级标注来定位动作实例的起止时间并识别其类别。目前基于视觉语言的方法利用文本提示信息来提升时序动作定位模型的性能。在视觉语言模型中,动作标签文本通常被封装为文本提示信息,按类型可分为手工类型提示(handcrafted prompts)和可学习类型提示(learnable prompts),而现有方法忽略了二者间的互补性,使得引入的文本提示信息无法充分发挥其引导作用。为此,提出一种多类型提示互补的弱监督时序动作定位模型(multi-type prompts complementary model for weakly-supervised temporal action location)。方法首先,设计提示交互模块,针对不同类型的文本提示信息分别与视频进行交互,并通过注意力加权,从而获得不同尺度的特征信息;其次,为了实现文本与视频对应关系的建模,本文利用一种片段级对比损失来约束文本提示信息与动作片段之间的匹配;最后,设计阈值筛选模块,将多个分类激活序列(class activation sequence,CAS)中的得分进行筛选比较,以增强动作类别的区分性。结果在3个具有代表性的数据集THUMOS14、ActivityNet1.2和ActivityNet1.3上与同类方法进行比较。本文方法在THUMOS14数据集中的平均精度均值(mean average precision,mAP)(0.1∶0.7)取得39.1%,在ActivityNet1.2中mAP(0.5∶0.95)取得27.3%,相比于P-MIL(proposal-based multiple instance learning)方法分别提升1.1%和1%。而在ActivityNet1.3数据集中mAP(0.5∶0.95)取得了与对比工作相当的性能,平均mAP达到26.7%。结论本文提出的时序动作定位模型,利用两种类型文本提示信息的互补性来引导模型定位,提出的阈值筛选模块可以最大化利用两种类型文本提示信息的优势,最大化其辅助作用,使定位的结果更加准确。Objective Weakly supervised temporal action localization uses only video-level annotations to locate the start and end times of action instances and identify their categories.Only video-level annotations are available in weakly supervised environments;thus,directly designing a loss function for the task is impossible.Therefore,the existing work generally adopts the strategy of“localization by classification”and utilizes multi-example learning for training.However,this process has some limitations:1)Localization and classification are two different tasks,revealing a notable gap between them;therefore,localization based on classification results may affect the final performance.2)In weakly-supervised environments,fine-grained supervisory information to effectively distinguish between actions and backgrounds in videos is lacking,thereby posing a remarkable challenge for localization.Visual language models have recently received extensive attention.These models aim to model the correspondence between images and texts for more comprehensive visual perception.Specific textual prompts can improve the performance and robustness of the models to effectively apply large models to downstream tasks.Visual language-based approaches currently utilize auxiliary textual prompt information to compensate for supervisory information and improve the performance and robustness of temporal action localization models.In visual language models,action label text is regularly encapsulated as textual prompts,which can be categorized into Handcrafted Prompts and Learnable Prompts.Handcrafted Prompts comprise fixed templates and action labels(e.g.,“a video of{class})”,which can learn a more generalized knowledge of the action class but lacks the specific knowledge of the relevant action.In contrast,Learnable Prompts comprise a set of learnable vectors,which can be adjusted and optimized during the training process.Therefore,the learnable type cues can learn more specific knowledge.The two types of text cues complement each other,impr
关 键 词:弱监督时序动作定位(WTAL) 视觉语言模型 手工类型提示 可学习类型提示 分类激活序列(CAS)
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...