Image Retrieval Based on Vision Transformer and Masked Learning  被引量:5

在线阅读下载全文

作  者:李锋 潘煌圣 盛守祥 王国栋 LI Feng;PAN Huangsheng;SHENG Shouxiang;WANG Guodong(College of Computer Science and Technology,Donghua University,Shanghai 201620,China;Huafang Co.,Ltd.,Binzhou 256617,China)

机构地区:[1]College of Computer Science and Technology,Donghua University,Shanghai 201620,China [2]Huafang Co.,Ltd.,Binzhou 256617,China

出  处:《Journal of Donghua University(English Edition)》2023年第5期539-547,共9页东华大学学报(英文版)

基  金:the Project of Introducing Urgently Needed Talents in Key Supporting Regions of Shandong Province,China(No.SDJQP20221805)。

摘  要:Deep convolutional neural networks(DCNNs)are widely used in content-based image retrieval(CBIR)because of the advantages in image feature extraction.However,the training of deep neural networks requires a large number of labeled data,which limits the application.Self-supervised learning is a more general approach in unlabeled scenarios.A method of fine-tuning feature extraction networks based on masked learning is proposed.Masked autoencoders(MAE)are used in the fine-tune vision transformer(ViT)model.In addition,the scheme of extracting image descriptors is discussed.The encoder of the MAE uses the ViT to extract global features and performs self-supervised fine-tuning by reconstructing masked area pixels.The method works well on category-level image retrieval datasets with marked improvements in instance-level datasets.For the instance-level datasets Oxford5k and Paris6k,the retrieval accuracy of the base model is improved by 7%and 17%compared to that of the original model,respectively.

关 键 词:content-based image retrieval vision transformer masked autoencoder feature extraction 

分 类 号:TP391.41[自动化与计算机技术—计算机应用技术] TP18[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象