Adequate alignment and interaction for cross-modal retrieval  

在线阅读下载全文

作  者:Mingkang WANG Min MENG Jigang LIU Jigang WU 

机构地区:[1]School of Computer Science and Technology,Guangdong University of Technology,Guangzhou 510006,China [2]Ping An Life Insurance of China,Shenzhen 518033,China

出  处:《Virtual Reality & Intelligent Hardware》2023年第6期509-522,共14页虚拟现实与智能硬件(中英文)

基  金:Supported by the National Natural Science Foundation of China (62172109,62072118);the National Science Foundation of Guangdong Province (2022A1515010322);the Guangdong Basic and Applied Basic Research Foundation (2021B1515120010);the Huangpu International Sci&Tech Cooperation foundation of Guangzhou (2021GH12)。

摘  要:Background Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications,particularly image-text retrieval in the fields of computer vision and natural language processing.Recently,visual and semantic embedding(VSE)learning has shown promising improvements in image text retrieval tasks.Most existing VSE models employ two unrelated encoders to extract features and then use complex methods to contextualize and aggregate these features into holistic embeddings.Despite recent advances,existing approaches still suffer from two limitations:(1)without considering intermediate interactions and adequate alignment between different modalities,these models cannot guarantee the discriminative ability of representations;and(2)existing feature aggregators are susceptible to certain noisy regions,which may lead to unreasonable pooling coefficients and affect the quality of the final aggregated features.Methods To address these challenges,we propose a novel cross-modal retrieval model containing a well-designed alignment module and a novel multimodal fusion encoder that aims to learn the adequate alignment and interaction of aggregated features to effectively bridge the modality gap.Results Experiments on the Microsoft COCO and Flickr30k datasets demonstrated the superiority of our model over state-of-the-art methods.

关 键 词:Cross-modal retrieval Visual semantic embedding Feature aggregation Transformer 

分 类 号:TP391.3[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象