面向卷积神经网络的高并行度FPGA加速器设计  被引量:7

Design of FPGA accelerator with high parallelism for convolution neural network

在线阅读下载全文

作  者:王晓峰 蒋彭龙[1,2] 周辉[1,2] 赵雄波 WANG Xiaofeng;JIANG Penglong;ZHOU Hui;ZHAO Xiongbo(Beijing Aerospace Automatic Control Institute,Beijing 100854,China;National Key Laboratory of Science and Technology on Aerospace Intelligence Control,Beijing 100854,China)

机构地区:[1]北京航天自动控制研究所,北京100854 [2]宇航智能控制技术国家级重点实验室,北京100854

出  处:《计算机应用》2021年第3期812-819,共8页journal of Computer Applications

基  金:军队科研资助项目;中国运载火箭技术研究院创新研发项目。

摘  要:大多数基于卷积神经网络(CNN)的算法都是计算密集型和存储密集型的,很难应用于具有低功耗要求的航天、移动机器人、智能手机等嵌入式领域。针对这一问题,提出一种面向CNN的高并行度现场可编程逻辑门阵列(FPGA)加速器。首先,比较研究CNN算法中可用于FPGA加速的4类并行度;然后,提出多通道卷积旋转寄存流水(MCRP)结构,简洁有效地利用了CNN算法的卷积核内并行;最后,采用输入输出通道并行+卷积核内并行的方案提出一种基于MCRP结构的高并行度CNN加速器架构,并将其部署到XILINX的XCZU9EG芯片上,在充分利用片上数字信号处理器(DPS)资源的情况下,峰值算力达到2 304 GOPS。以SSD-300算法为测试对象,该CNN加速器的实际算力为1 830.33 GOPS,硬件利用率达79.44%。实验结果表明,MCRP结构可有效提高CNN加速器的算力,基于MCRP结构的CNN加速器可基本满足嵌入式领域大部分应用的算力需求。Most of the algorithms based on Convolutional Neural Network(CNN)are computation-intensive and memory-intensive,so they are difficult to be applied in embedded fields such as aerospace,mobile robotics and smartphones which have low-power requirements.To solve this problem,a Field Programmable Gate Array(FPGA)accelerator with high parallelism for CNN was proposed.Firstly,four kinds of parallelism in CNN algorithm that can be used for FPGA acceleration were compared and studied.Then,a Multi-channel Convolutional Rotating-register Pipeline(MCRP)structure was proposed to concisely and effectively utilize the convolution kernel parallelism of CNN algorithm.Finally,using the strategy of input/output channel parallelism+convolution kernel parallelism,a CNN accelerator architecture with high parallelism was proposed based on MCRP structure,and to verify the design rationality of the architecture,it was deployed on the XCZU9EG chip of XILINX.Under the condition of making full use of the on-chip Digital Signal Processor(DSP)resources,the peak computing capacity of the proposed CNN accelerator reached 2304 GOPS(Giga Operations Per Second).Taking SSD-300 algorithm as the test object,this CNN accelerator had the actual computing capacity of 1830.33 GOPS,and the hardware utilization rate of 79.44%.Experimental results show that,the MCRP structure can effectively improve the computing capacity of CNN accelerator,and the CNN accelerator based on MCRP structure can generally meet the computing capacity requirements of most applications in the embedded fields.

关 键 词:卷积神经网络 高性能 硬件加速器 并行度 现场可编程逻辑门阵列 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象