基于CLIP模型和知识数据库的零样本动作识别

Zero-Shot Action Recognition Based on CLIP Model and Knowledge Database

作　　者：侯永宏郑皓春高嘉俊任懿 Hou Yonghong;Zheng Haochun;Gao Jiajun;Ren Yi(School of Electrical and Information Engineering,Tianjin University,Tianjin 300072,China;School of Future Technology,Tianjin University,Tianjin 300072,China;Institute of Software,Chinese Academy of Sciences,Beijing 100190,China)

机构地区：[1]天津大学电气自动化与信息工程学院,天津300072 [2]天津大学未来技术学院,天津300072 [3]中国科学院软件研究所,北京100190

出　　处：《天津大学学报（自然科学与工程技术版）》2025年第1期91-100,共10页Journal of Tianjin University：Science and Technology

基　　金：国家自然科学基金资助项目(62102422)。

摘　　要：零样本动作识别旨在从已知类别的动作样本数据中学习知识,并将其迁移到未知的动作类别上,从而实现对未知动作样本的识别和分类.现有的零样本动作识别模型依赖有限的训练数据,可学习到的先验知识有限,难以将视觉特征准确地映射到语义标签上,是限制零样本学习性能提升的关键因素.针对上述问题,本文提出了一种引入外部知识数据库和CLIP模型的零样本学习框架,利用多模态CLIP模型通过自监督对比学习方式积累的知识,来扩充零样本动作识别模型的先验知识.同时,设计了时序编码器,以弥补CLIP模型时序建模能力的欠缺.为了使模型学习到更丰富的语义特征,缩小视觉特征和语义标签之间的语义鸿沟,本文扩展了已知动作类别的语义标签,用更为详细的描述语句代替简单的文本标签,丰富了文本表示的语义信息;在此基础上,在模型外部构建了一个知识数据库,在不增加模型参数规模的条件下为模型提供额外的辅助信息,强化视觉特征与文本特征表示之间的关联关系.最后,本文遵循零样本学习规范,对模型进行微调,使其适应零样本动作识别任务,提高了模型的泛化能力.所提方法在HMDB51和UCF101两个主流数据集上进行了广泛实验,实验数据表明,该方法的识别性能相比目前的先进方法在上述两个数据集上分别提升了3.8%和2.3%,充分体现了所提方法的有效性.Zero-shot action recognition(ZSAR)aims to learn knowledge from seen action classes and apply it to unseen action classes,thereby achieving recognition and classification of unknown action samples.However,existing ZSAR models are limited by the amount of training data.This restricts their capability to learn prior knowledge and the accurate mapping of visual features with semantic labels.To address this issue,a ZSAR framework was proposed in this study by introducing an external knowledge database and using the contrastive language-image pretraining(CLIP)model.This framework utilized the knowledge acquired through self-supervised contrastive learning by the multimodal CLIP model to expand the prior knowledge of ZSAR.Moreover,a temporal encoder was designed to compensate for the lack of temporal modeling capability of the CLIP model.To enhance semantic features and bridge the gap between visual features and semantic labels,the semantic labels of seen action classes were extended.This involved replacing simple text labels with more detailed descriptive sentences to enrich the semantic information of text representations.On this basis,a knowledge database was constructed outside the model.This approach provided additional information without increasing the model parameter scale and strengthens the association between the visual and text features.Finally,following the ZSAR protocol,the model was fine-tuned for the ZSAR task to im-prove its generalization ability.Furthermore,the proposed method was extensively experimented on two mainstream datasets:HMDB51 and UCF101.The experimental results demonstrate significant improvements of 3.8%and 2.3%on the above two datasets,respectively,compared with previous methods,validating the effectiveness of the pro-posed approach.

关键词：零样本学习动作识别 CLIP模型知识数据库

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于CLIP模型和知识数据库的零样本动作识别

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于CLIP模型和知识数据库的零样本动作识别

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索