机构地区:[1]湖北大学人工智能学院,武汉430062 [2]湖北大学计算机与信息工程学院,武汉430062
出 处:《计算机应用》2024年第10期3209-3216,共8页journal of Computer Applications
基 金:教育部产学合作协同育人项目(202101142041)。
摘 要:水下鱼类图像分类是一项极具挑战性的任务。传统Vision Transformer(ViT)网络骨干的局限性较大,难以处理局部连续特征,在图像质量较低的鱼类分类中效果表现不佳。为解决此问题,提出一种基于位置编码的重叠切块嵌入(OPE)和多尺度通道交互注意力(MCIA)的Transformer图像分类网络PIFormer(Positional overlapping and Interactive attention transFormer)。PIFormer采用多层级形式构建,每层以不同次数堆叠,利于提取不同深度的特征。首先,引入深度位置编码重叠切块嵌入(POPE)模块对特征图与边缘信息进行重叠切块,以保留鱼体的局部连续特征,并添加位置信息以排序,帮助PIFormer整合细节特征和构建全局映射;其次,提出MCIA模块并行处理局部与全局特征,并建立鱼体不同部位的长距离依赖关系;最后,由分组多层感知机(GMLP)分组处理高层次特征,以提升网络效率,并实现最终的鱼类分类。为验证PIFormer的有效性,提出自建东湖淡水鱼类数据集,并使用公共数据集Fish4Knowledge与NCFM(Nature Conservancy Fisheries Monitoring)以确保实验公平性。实验结果表明,所提网络在各数据集上的Top-1分类准确率分别达到了97.99%、99.71%和90.45%,与同级深度的ViT、Swin Transformer和PVT(Pyramid Vision Transformer)相比,参数量分别减少了72.62×10^(6)、14.34×10^(6)和11.30×10^(6),浮点运算量(FLOPs)分别节省了14.52×10^(9)、2.02×10^(9)和1.48×10^(9)。可见,PIFormer在较少的计算负荷下,具有较强的鱼类图像分类能力,取得了优越的性能。Underwater fish image classification is a highly challenging task.The traditional Vision Transformer(ViT)network backbone is limited to process local continuous features,and it does not perform well in fish classification with lower image quality.To solve this problem,a Transformer-based image classification network based on Overlapping Patch Embedding(OPE)and Multi-scale Channel Interactive Attention(MCIA),called PIFormer(Positional overlapping and Interactive attention transFormer),was proposed.PIFormer was built in a multi-layer format with each layer stacked at different times to facilitate the extraction of features at different depths.Firstly,the deep Positional Overlapping Patch Embedding(POPE)module was introduced to overlap and slice the feature map and edge information,so as to retain the local continuous features of the fish body.At the same time,position information was added for sorting,thereby helping PIFormer integrate the detailed features and build the global map.Then,the MCIA module was proposed to process the local and global features in parallel,and establish the long-distance dependencies of different parts of the fish body.Finally,the high-level features were processed by Group Multi-Layer Perceptron(GMLP)to improve the efficiency of the network and realize the final fish classification.To verify the effectiveness of PIFormer,a self-built dataset of freshwater fishes in East Lake was proposed,and the public datasets Fish4Knowledge and NCFM(Nature Conservancy Fisheries Monitoring)were used to ensure experimental fairness.Experimental results demonstrate that the Top-1 classification accuracy of the proposed network on each dataset reaches 97.99%,99.71%and 90.45%respectively.Compared with ViT,Swin Transformer and PVT(Pyramid Vision Transformer)of the same depth,the proposed network has the number of parameters reduced by 72.62×10^(6),14.34×10^(6) and 11.30×10^(6) respectively,and the FLoating point Operation Per second(FLOPs)saved by 14.52×10^(9),2.02×10^(9) and 1.48×10^(9) respectively.
关 键 词:鱼类图像分类 位置编码 重叠切块嵌入 通道交互注意力 Vision Transformer
分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...