检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:阎刚 王浩天 YAN Gang;WANG Haotian(School of Artificial Intelligence,Hebei University of Technology,Tianjin 300401,China)
机构地区:[1]河北工业大学人工智能与数据科学学院,天津300401
出 处:《河北工业大学学报》2025年第2期32-41,共10页Journal of Hebei University of Technology
基 金:国家自然科学基金资助项目(62102129)。
摘 要:随着视频资源日益丰富,跨模态视频片段检索的研究逐渐兴起,由于视频和文本来自不同的特征空间,如何学习公共特征空间解决数据间的语义鸿沟成为关键问题。现有方法利用跨模态编码器将不同模态的信息进行特征对齐,但是同一视频中的多个片段会产生相互干扰,导致视频表征过于粗糙。又由于跨模态编码器的计算量过大,导致检索时间过长。针对这2个问题,提出了一种基于多重对比学习的两阶段视频片段检索网络(MCLNet),该模型通过视频级、片段级对比学习和视频模态内对比学习,优化特征对齐,减少干扰,解决了视频表征过于粗糙的问题。另外,该模型利用两阶段方法将视频检索和时刻定位任务分为两阶段执行,使得视频可在第一阶段进行预编码存储,解决了模型检索时间过长的问题。在TVR、DiDeMo 2个视频片段检索数据集上的实验结果表明了MCLNet的有效性。With the increasing abundance of video resources,the research on cross-modal video moment retrieval has gradually emerged.Because video and text come from different feature Spaces,how to learn a common feature space to solve the semantic gap between data has become the critical issue.Existing methods use cross-modal encoders to align information features of different modalities,but multiple clips in the same video will interfere with each other,resulting in too rough video representation.Moreover,the computational complexity of the cross-modal encoder is too large,which leads to long retrieval time.To solve these two problems,a two-stage video moment retrieval network with multiple contrastive learning(MCLNet)was proposed.The model optimized feature alignment,reduced interference and solved the problem of too rough video representation through video-level contrastive learning,clip-level contrastive learning and intra-video modal contrastive learning.In addition,the model uses a two-stage method to perform video retrieval and moment location tasks in two stages,so that the video can be precoded and stored in the first stage,which solves the problem of long retrieval time of the model.Experimental results on two video moment retrieval datasets TVR,DiDeMo demonstrate the effectiveness of MCLNet.
关 键 词:跨模态视频片段检索 公共特征空间 特征对齐 对比学习 视频表征
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.200