检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:Mingkang WANG Min MENG Jigang LIU Jigang WU
机构地区:[1]School of Computer Science and Technology,Guangdong University of Technology,Guangzhou 510006,China [2]Ping An Life Insurance of China,Shenzhen 518033,China
出 处:《Virtual Reality & Intelligent Hardware》2023年第6期509-522,共14页虚拟现实与智能硬件(中英文)
基 金:Supported by the National Natural Science Foundation of China (62172109,62072118);the National Science Foundation of Guangdong Province (2022A1515010322);the Guangdong Basic and Applied Basic Research Foundation (2021B1515120010);the Huangpu International Sci&Tech Cooperation foundation of Guangzhou (2021GH12)。
摘 要:Background Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications,particularly image-text retrieval in the fields of computer vision and natural language processing.Recently,visual and semantic embedding(VSE)learning has shown promising improvements in image text retrieval tasks.Most existing VSE models employ two unrelated encoders to extract features and then use complex methods to contextualize and aggregate these features into holistic embeddings.Despite recent advances,existing approaches still suffer from two limitations:(1)without considering intermediate interactions and adequate alignment between different modalities,these models cannot guarantee the discriminative ability of representations;and(2)existing feature aggregators are susceptible to certain noisy regions,which may lead to unreasonable pooling coefficients and affect the quality of the final aggregated features.Methods To address these challenges,we propose a novel cross-modal retrieval model containing a well-designed alignment module and a novel multimodal fusion encoder that aims to learn the adequate alignment and interaction of aggregated features to effectively bridge the modality gap.Results Experiments on the Microsoft COCO and Flickr30k datasets demonstrated the superiority of our model over state-of-the-art methods.
关 键 词:Cross-modal retrieval Visual semantic embedding Feature aggregation Transformer
分 类 号:TP391.3[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28