Transformer与CNN融合的单目图像深度估计  被引量:5

Monocular Image Depth Estimation Based on the Fusion of Transformer and CNN

在线阅读下载全文

作  者:张涛 张晓利[1] 任彦[1] ZHANG Tao;ZHANG Xiao-li;REN Yan(School of Information Engineering,Inner Mongolia University of Science and Technology,Baotou 014000,China)

机构地区:[1]内蒙古科技大学信息工程学院,内蒙古包头014000

出  处:《哈尔滨理工大学学报》2022年第6期88-94,共7页Journal of Harbin University of Science and Technology

基  金:内蒙古自治区科技计划项目(2020GG0048)。

摘  要:针对单目视觉图像深度估计时存在精度低的问题,提出一种Transformer和CNN融合的单目图像深度估计方法。首先,采用ResNet-50作为编码器-解码器网络的主干网络对图像特征进行提取,同时在编码器-解码器网络中采用层级融合的方法,将编码器各层级特征进行融合作为解码器的输入,提升深度估计网络对多尺度特征信息的利用率。其次,采用Transformer网络对解码器的输出特征进行全局分析,Transformer网络中的多头注意力机制从解码器输出的深层特征中估计深度信息,提高深度估计网络对多尺度特征的提取能力进而提高深度图的精准度。在NYU Depth-v2数据集上完成模型有效性验证。实验结果表明,与多尺度卷积神经网络相比,该方法在精度δ<1.25上提高24.3%,在均方根误差指标上降低61.3%。证明其在单目图像深度估计的可行性。Aiming at the problem of low accuracy in monocular vision image depth estimation, a monocular image depth estimation method based on Transformer and convolutional neural network is proposed. First, ResNet-50 is used as the backbone network of the encoder-decoder network to extract image features. At the same time, the encoder-decoder network adopts a level fusion method to fuse the features of each level of the encoder as the decoder to input to improve the utilization of multi-scale feature information by the depth estimation network. Secondly, the Transformer network is used to perform global analysis on the output features of the decoder. The multi-head attention mechanism in the Transformer network estimates the depth information from the deep features output by the decoder, which improves the depth estimation network′s ability to extract multi-scale features and thus improves the depth map accuracy. The validation of the model was completed on the NYU Depth-v2 dataset. Experimental results show that compared with multi-scale convolutional neural networks, this method is improved by 19.9% in δ<1.25 and root mean square error is reduced by 49.9%. The feasibility of proposed method in estimating depth from a monocular image is proved.

关 键 词:卷积神经网络 编码器-解码器 TRANSFORMER 深度估计 单目视觉 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象