基于CNN和Transformer双流融合的人体姿态估计  

Human Pose Estimation Based on Dual-Stream Fusion of CNN and Transformer

在线阅读下载全文

作  者:李鑫[1] 张丹 郭新 汪松[1] 陈恩庆[1] LI Xin;ZHANG Dan;GUO Xin;WANG Song;CHEN Enqing(School of Electrical and Information Engineering,Zhengzhou University,Zhengzhou 450001,China;College of Landscape and Horticulture,Henan Forestry Vocational College,Luoyang,Henan 471002,China)

机构地区:[1]郑州大学电气与信息工程学院,郑州450001 [2]河南林业职业学院园林园艺学院,河南洛阳471002

出  处:《计算机工程与应用》2025年第5期187-199,共13页Computer Engineering and Applications

基  金:国家自然科学基金(62101503,62301497);河南省科技攻关项目(222102210102)。

摘  要:卷积神经网络(CNN)和Transformer模型在人体姿态估计中有着广泛应用,然而Transformer更注重捕获图像的全局特征,忽视了局部特征对于人体姿态细节的重要性,而CNN则缺乏Transformer的全局建模能力。为了充分利用CNN处理局部信息和Transformer处理全局信息的优势,构建一种CNN-Transformer双流的并行网络架构来聚合丰富的特征信息。由于传统Transformer的输入需要将图片展平为多个patch,不利于提取对位置敏感的人体结构信息,因此将其多头注意力结构进行改进,使模型输入能够保持原始2D特征图的结构;同时提出特征耦合模块融合两个分支不同分辨率下的特征,最大限度地保留局部特征与全局特征;最后引入改进后的坐标注意力模块(coordinate attention),进一步提升网络的特征提取能力。在COCO和MPII数据集上的实验结果表明所提模型相对目前主流模型具有更高的检测精度,从而说明所提模型能够充分捕获并融合人体姿态中的局部和全局特征。Convolutional neural network(CNN)and Transformer models are widely used in human pose estimation.However,Transformer focuses more on capturing the global features of images,and it overlooks the importance of local features for detailed human pose estimation.Conversely,CNN lacks the global modeling capabilities of Transformer.To fully leverage the strengths of CNN in processing local information and Transformer in capturing global information,this paper proposes a CNN-Transformer dual-flow parallel network architecture to aggregate rich feature information.Traditional Transformer requires flattening images into multiple patches,which is detrimental to extracting position-sensitive human structural information.Therefore,the multi-head attention structure is improved in this paper,so that the model input can maintain the structure of the original 2D feature map.Additionally,a feature coupling module is introduced to fuse features from different resolutions of the two branches,maximizing the retention of both local features and global features.Finally,an improved coordinate attention module is incorporated to further enhance the network’s feature extraction capability.Experimental results on COCO and MPII datasets demonstrate that the proposed model achieves higher detection accuracy compared to current mainstream models,which indicates that the proposed model can effectively capture and integrate both local and global features in the human pose.

关 键 词:卷积神经网络 TRANSFORMER 局部特征 全局特征 2D特征图 特征耦合 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象