基于Transformer的图文跨模态检索算法  被引量:5

Text-Image Cross-modal Retrieval Based on Transformer

在线阅读下载全文

作  者:杨晓宇[1] 李超[1] 陈舜尧 李浩亮 殷光强 YANG Xiaoyu;LI Chao;CHEN Shunyao;LI Haoliang;YIN Guangqiang(Center for Public Security Technology,University of Electronic Science and Technology of China,Chengdu 611731,China)

机构地区:[1]电子科技大学公共安全技术研究中心,成都611731

出  处:《计算机科学》2023年第4期141-148,共8页Computer Science

基  金:深圳市科技计划项目(JSGG20220301090405009)。

摘  要:随着互联网多媒体数据的不断增长,文本图像检索已成为研究热点。在图文检索中,通常使用相互注意力机制,通过将图像和文本特征进行交互,来实现较好的图文匹配结果。但是,这种方法不能获取单独的图像特征和文本特征,在大规模检索后期需要对图像文本特征进行交互,消耗了大量的时间,无法做到快速检索匹配。然而基于Transformer的跨模态图像文本特征学习取得了良好的效果,受到了越来越多的关注。文中设计了一种新颖的基于Transformer的文本图像检索网络结构(HAS-Net),该结构主要有以下几点改进:1)设计了一种分层Transformer编码结构,以更好地利用底层的语法信息和高层的语义信息;2)改进了传统的全局特征聚合方式,利用自注意力机制设计了一种新的特征聚合方式;3)通过共享Transformer编码层,使图片特征和文本特征映射到公共的特征编码空间。在MS-COCO数据集和Flickr30k数据集上进行实验,结果表明跨模态检索性能均得到提升,在同类算法中处于领先地位,证明了所设计的网络结构的有效性。With the growth of Internet multimedia data,text image retrieval has become a research hotspot.In image and text retrieval,the mutual attention mechanism is used to achieve better image-text matching results by interacting image and text features.However,this method cannot obtain image features and text features separately,and requires interaction of image and text features in the later stage of large-scale retrieval,which consumes a lot of time and is not able to achieve fast retrieval and ma-tching.However,the cross-modal image text feature learning based on Transformer has achieved good results and has received more and more attention from researchers.This paper designs a novel Transformer-based text image retrieval network structure(HAS-Net),which mainly has the following improvements:a hierarchical Transformer coding structure is designed to better utilize the underlying grammatical information and high-level semantic information;the traditional global feature aggregation method is improved,and the self-attention mechanism is used to design a new feature aggregation method;by sharing the Transformer coding layer,image features and text features are mapped to a common feature coding space.Finally,experiments are conducted on the MS-COCO and Flickr30k datasets,the cross-modal retrieval performance has been improved,and it is in a leading position among similar algorithms.It is proved that the designed network structure is effective.

关 键 词:TRANSFORMER 跨模态检索 特征分层提取 特征聚合 特征共享 

分 类 号:TP399[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象