检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:侯永宏 郑皓春 高嘉俊 任懿 Hou Yonghong;Zheng Haochun;Gao Jiajun;Ren Yi(School of Electrical and Information Engineering,Tianjin University,Tianjin 300072,China;School of Future Technology,Tianjin University,Tianjin 300072,China;Institute of Software,Chinese Academy of Sciences,Beijing 100190,China)
机构地区:[1]天津大学电气自动化与信息工程学院,天津300072 [2]天津大学未来技术学院,天津300072 [3]中国科学院软件研究所,北京100190
出 处:《天津大学学报(自然科学与工程技术版)》2025年第1期91-100,共10页Journal of Tianjin University:Science and Technology
基 金:国家自然科学基金资助项目(62102422)。
摘 要:零样本动作识别旨在从已知类别的动作样本数据中学习知识,并将其迁移到未知的动作类别上,从而实现对未知动作样本的识别和分类.现有的零样本动作识别模型依赖有限的训练数据,可学习到的先验知识有限,难以将视觉特征准确地映射到语义标签上,是限制零样本学习性能提升的关键因素.针对上述问题,本文提出了一种引入外部知识数据库和CLIP模型的零样本学习框架,利用多模态CLIP模型通过自监督对比学习方式积累的知识,来扩充零样本动作识别模型的先验知识.同时,设计了时序编码器,以弥补CLIP模型时序建模能力的欠缺.为了使模型学习到更丰富的语义特征,缩小视觉特征和语义标签之间的语义鸿沟,本文扩展了已知动作类别的语义标签,用更为详细的描述语句代替简单的文本标签,丰富了文本表示的语义信息;在此基础上,在模型外部构建了一个知识数据库,在不增加模型参数规模的条件下为模型提供额外的辅助信息,强化视觉特征与文本特征表示之间的关联关系.最后,本文遵循零样本学习规范,对模型进行微调,使其适应零样本动作识别任务,提高了模型的泛化能力.所提方法在HMDB51和UCF101两个主流数据集上进行了广泛实验,实验数据表明,该方法的识别性能相比目前的先进方法在上述两个数据集上分别提升了3.8%和2.3%,充分体现了所提方法的有效性.Zero-shot action recognition(ZSAR)aims to learn knowledge from seen action classes and apply it to unseen action classes,thereby achieving recognition and classification of unknown action samples.However,existing ZSAR models are limited by the amount of training data.This restricts their capability to learn prior knowledge and the accurate mapping of visual features with semantic labels.To address this issue,a ZSAR framework was proposed in this study by introducing an external knowledge database and using the contrastive language-image pretraining(CLIP)model.This framework utilized the knowledge acquired through self-supervised contrastive learning by the multimodal CLIP model to expand the prior knowledge of ZSAR.Moreover,a temporal encoder was designed to compensate for the lack of temporal modeling capability of the CLIP model.To enhance semantic features and bridge the gap between visual features and semantic labels,the semantic labels of seen action classes were extended.This involved replacing simple text labels with more detailed descriptive sentences to enrich the semantic information of text representations.On this basis,a knowledge database was constructed outside the model.This approach provided additional information without increasing the model parameter scale and strengthens the association between the visual and text features.Finally,following the ZSAR protocol,the model was fine-tuned for the ZSAR task to im-prove its generalization ability.Furthermore,the proposed method was extensively experimented on two mainstream datasets:HMDB51 and UCF101.The experimental results demonstrate significant improvements of 3.8%and 2.3%on the above two datasets,respectively,compared with previous methods,validating the effectiveness of the pro-posed approach.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.188.103.42