基于Transformer的增强局部特征的细粒度图像分类模型  

Fine-Grained Image Classification Model Based on Transformer and Enhanced Local Features

在线阅读下载全文

作  者:李烨[1] 蔡家麒 Ye Li;Jiaqi Cai(School of Optical-Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai)

机构地区:[1]上海理工大学,光电信息与计算机工程学院,上海

出  处:《建模与仿真》2024年第4期4702-4714,共13页Modeling and Simulation

基  金:国家自然科学基金项目(61703277);上海航海项目基金(17YF1427000)。

摘  要:ViT(Vision Transformer)已经被广泛地运用于精细级别的视觉分类上,针对其对于局部信息捕获能力不足的问题,提出一种新的基于Transformer的增强局部特征的细粒度图像分类模型。首先提出了注意力嵌入模块,借由可变形卷积和注意力模块在输入模型之前将原图转换为更关注重要信息的特征,之后再将这些特征嵌入到模型中去,从而提升输入的有效局部特征。其次,提出增强自注意力模块用于ViT原始模型中,使得全局依赖和局部依赖关系可以同时被处理,通过自注意力机制和卷积操作的结合,可以更好地处理局部特征。最后,采用交叉熵损失和对比损失结合的方式,对子类别之间微小的差异进行了优化,以尽可能降低不同标签分类token的相似度,提高相同标签分类token的相似度。所提的算法在CUB-200-2011、Stanford Dogs和NABirds三个细粒度图像数据集的识别精确度达到了91.8%、90.1%和90.3%,超越了多种业内领先的细粒度图像分类技术。ViT(Vision Transformer)has been widely applied to fine-grained visual classification.To address its deficiency in capturing local information,a new fine-grained image classification model based on Transformer and enhanced local features is proposed.Initially,an attention embed-ding module is introduced,utilizing deformable convolution and attention modules to trans-form the original image into features that focus more on important information before being input into the model,thereby enhancing the effective local features of the input.Secondly,an en-hanced self-attention module is proposed for use in the original ViT model,allowing for simulta-neous processing of global and local dependencies.The combination of self-attention mechanisms and convolution operations facilitates better handling of local features.Lastly,a combined ap-proach of cross-entropy loss and contrastive loss is employed to optimize the subtle differences between sub-categories,aiming to minimize the similarity of classification tokens with different labels and increase the similarity of those with the same labels.The proposed algorithm achieved recognition accuracies of 91.8%,90.1%,and 90.3%on the CUB-200-2011,Stanford Dogs,and NA-Birds fine-grained image datasets respectively,surpassing several leading fine-grained image classification technologies in the industry.

关 键 词:细粒度图像分类 VISION TRANSFORMER 局部特征 可变形卷积 自注意力模块 

分 类 号:TP3[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象