基于增强视觉Transformer的哈希食品图像检索

Hash Food Image Retrieval Based on Enhanced Vision Transformer

作　　者：曹品丹闵巍庆宋佳骏盛国瑞[1] 杨延村[1] 王丽丽[1] 蒋树强[2] CAO Pindan;MIN Weiqing;SONG Jiajun;SHENG Guorui;YANG Yancun;WANG Lii;JIANG Shuqiang(School of Information and Electrical Engineering,Ludong University,Yantai 264025,China;Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;School of Agricultural Economics and Rural Development,Renmin University of China,Beijing 100872,China)

机构地区：[1]鲁东大学信息与电气工程学院,山东烟台264025 [2]中国科学院计算技术研究所,北京100190 [3]中国人民大学农业与农村发展学院,北京100872

出　　处：《食品科学》2024年第10期1-8,共8页Food Science

基　　金：国家自然科学基金青年科学基金项目(61705098);国家自然科学基金面上项目(61872170);山东省自然科学基金项目(ZR2023MF031)。

摘　　要：作为食品计算的一个主要任务,食品图像检索近年来受到了广泛的关注。然而,食品图像检索面临着两个主要的挑战。首先,食品图像具有细粒度的特点,这意味着不同食品类别之间的视觉差异可能很小,这些差异只能在图像的局部区域中观察到。其次,食品图像包含丰富的语义信息,如食材、烹饪方式等,这些信息的提取和利用对于提高检索性能至关重要。为解决这些问题,本实验基于预训练的视觉Transformer(Vision Transformer,ViT)模型提出了一种增强ViT的哈希网络(enhanced ViT hash network,EVHNet)。针对食品图像的细粒度特点,EVHNet中设计了一个基于卷积结构的局部特征增强模块,使网络能够学习到更具有代表性的特征。为更好地利用食品图像的语义信息,EVHNet中还设计了一个聚合语义特征模块,根据类令牌特征来聚合食品图像中的语义信息。本实验提出的EVHNet模型在贪婪哈希、中心相似量化和深度极化网络3种流行的哈希图像检索框架下进行评估,并与AlexNet,ResNet50、ViT-B_32和ViT-B_164种主流网络模型进行比较,在Food-101、Vireo Food-172、UEC Food-2563个食品数据集上的实验结果表明,EVHNet模型在检索精度上的综合性能优于其他模型。Food image retrieval,a major task in food computing,has garnered extensive attention in recent years.However,it faces two primary challenges.First,food images exhibit fine-grained characteristics,implying that visual differences between different food categories may be subtle and often can only be observable in local regions of the image.Second,food images contain abundant semantic information,such as ingredients and cooking methods,whose extraction and utilization are crucial for enhancing the retrieval performance.To address these issues,this paper proposes an enhanced ViT hash network(EVHNet)based on a pre-trained Vision Transformer(ViT)model.Given the fine-grained nature of food images,a local feature enhancement module enabling the network to learn more representative features was designed in EVHNet based on convolutional structure.To better leverage the semantic information in food images,an aggregated semantic feature module aggregating the information based on class token features was designed in EVHNet.The proposed EVHNet model was evaluated under three popular hash image retrieval frameworks,namely greedy hash(GreedyHash),central similarity quantization(CSQ),and deep polarized network(DPN),and compared with four mainstream network models,AlexNet,ResNet50,ViT-B_32,and ViT-B_16.Experimental results on the Food-101,Vireo Food-172,and UEC Food-256 food datasets demonstrated that the EVHNet model outperformed other models in terms of comprehensive retrieval accuracy.

关键词：食品图像检索食品计算哈希检索 VisionTransformer网络深度哈希学习

分类号：S126[农业科学—农业基础科学]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于增强视觉Transformer的哈希食品图像检索

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于增强视觉Transformer的哈希食品图像检索

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索