基于FPGA的浮点可分离卷积神经网络加速方法  被引量:2

FPGA based floating point separable convolutional neural network acceleration method

在线阅读下载全文

作  者:张志超 王剑[1,2,3] 章隆兵 肖俊华[1,2,3] ZHANG Zhichao;WANG Jian;ZHANG Longbing;XIAO Junhua(State Key Laboratory of Computer Architecture,Institute of Computing Technology,Chinese Academy of Science,Beijing 100190;Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190;University of Chinese Academy of Sciences,Beijing 100049;The 15th Research Institute of China Electronics Technology Group Corporation,Beijing 100083)

机构地区:[1]计算机体系结构国家重点实验室(中国科学院计算技术研究所),北京100190 [2]中国科学院计算技术研究所,北京100190 [3]中国科学院大学,北京100049 [4]中国电子科技集团公司第十五研究所,北京100083

出  处:《高技术通讯》2022年第5期441-453,共13页Chinese High Technology Letters

基  金:国家自然科学基金(61432016);国家重点研发计划(2018YFC0832306,2018YFC0831203,2018YFC0831206)资助项目。

摘  要:针对可分离卷积神经网络在星载飞机目标型号分类应用中存在的速度瓶颈以及功耗限制等问题,提出了一种基于现场可编程门阵列(FPGA)数据流调度的浮点深度分离卷积神经网络加速方法,对通用MobileNet的图像分类模型进行加速。采用基于乘法矩阵与前向加法树的深度分离卷积计算阵列设计,解决了深度分离卷积浮点加速的线速吞吐瓶颈。实验结果表明,基于FPGA的目标分类速度为633 FPS,功耗为22.226 W,运算性能为236.04 GFLOPS,计算速度达到了Titan Xp GPU的1.10~2.61倍,计算效能是Titan Xp GPU的7.44~18.66倍。在同类基于FPGA的浮点卷积加速方案中,该方法在运算性能及能效比上达到了最优。同时,该方法提供了与原模型一致性的图像分类准确率,解耦合了软硬件协同开发流程,降低了应用开发人员使用FPGA加速计算的门槛。In order to solve the problems of speed bottleneck and power limitation in the application of separable convolutional neural network in space-borne aircraft target classification,a floating point depthwise separable convolution neural network acceleration method is proposed based on field programmable gate array(FPGA)data stream scheduling to accelerate the general MobileNet image classification model.The design of depthwise separable convolution computation array based on multiplication matrix and forward addition tree is adopted to solve the bottleneck of line speed throughput in floating point acceleration of depthwise separable convolution.Experimental results show that the target classification based on FPGA has a speed of 633 FPS,a power consumption of 22.226 W,and a computing performance of 236.04 GFLOPS.The computational speed is 1.10-2.61 times higher than that of Titan Xp GPU,and the computational efficiency is 7.44-18.66 times higher than that of Titan Xp GPU.In the same kind of FPGA-based floating-point convolution acceleration scheme,the proposed method achieves the best performance and energy efficiency ratio.At the same time,the proposed method provides image classification accuracy consistent with the original model,decouples the software/hardware collaborative development process,and reduces the threshold for application developers to use FPGA to accelerate calculation.

关 键 词:深度可分离卷积 现场可编程门阵列(FPGA) 数据流调度 加速 图像分类 

分 类 号:TP751[自动化与计算机技术—检测技术与自动化装置] TN791[自动化与计算机技术—控制科学与工程] TP183[电子电信—电路与系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象