基于FPGA的多核可扩展卷积加速器设计  被引量:1

Design of CNN accelerator with multi-core based on FPGA

在线阅读下载全文

作  者:张坤宁 赵烁 孙庆斌 邓宁[1] 何虎[1] ZHANG Kun-ning;ZHAO Shuo;SUN Qing-bin;DENG Ning;HE Hu(Institute of Microelectronics,Tsinghua University,Beijing 100084,China)

机构地区:[1]清华大学微电子学研究所,北京100084

出  处:《计算机工程与设计》2021年第6期1592-1598,共7页Computer Engineering and Design

基  金:国家自然科学基金项目(91846303)。

摘  要:为解决卷积神经网络计算效率和能效较低的问题,提出并设计一种使用定点数据作为输入的卷积加速器。加速器支持动态量化的8 bits定点数据的卷积计算,通过采用分块计算的策略和改进的循环计算顺序,有效提高计算效率;支持激活、批标准化(BN)、池化和全连接等计算;基于软硬件协同设计的思路,设计包含卷积加速器和ARM处理器在内的SoC系统。提出一种将加速器进行多核扩展的方法,提高算力和移植便捷性。将加速器部署在Xilinx ZCU102开发板上,其中单核加速器的算力达到了153.6 GOP/s,在计算核数目增加到4个和8个的情况下,算力分别增至614.4 GOP/s和1024 GOP/s。To solve the problem of low computation and energy efficiency of convolutional neural networks,a CNN hardware accelerator based on FPGA was proposed.The computation of dynamically quantified 8-bits fixed-point data was supported.The computation efficiency was effectively improved by adopting a tiling strategy and optimized circular calculation order.Calculations such as activation,batch normalization(BN),pooling and full connection were supported.Based on the idea of the co-design of hardware and software,a SoC system including accelerator and ARM processor was proposed.A strategy for multi-core expansion of the accelerator was also proposed to further increase the computing performance and improve the convenience of deploying the accelerator on different FPGA platforms.The accelerator was deployed on the Xilinx ZCU102.The computing performance of one-core accelerator can reach 153.6 GOP/s.As the number of accelerator core expands to four and eight,the computing performance is increased to 614.4 GOP/s and 1024 GOP/s,respectively.

关 键 词:卷积加速 数据复用 并行计算 多核扩展 软硬件协作 

分 类 号:TP332[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象