SAF-CNN:面向嵌入式FPGA的卷积神经网络稀疏化加速框架  被引量:2

SAF-CNN:A Sparse Acceleration Framework of Convolutional Neural Network for Embedded FPGAs

在线阅读下载全文

作  者:谢坤鹏 仪德智 刘义情 刘航 赫鑫宇 龚成 卢冶 Xie Kunpeng;Yi Dezhi;Liu Yiqing;Liu Hang;He Xinyu;Gong Cheng;Lu Ye(College of Computer Science,Nankai University,Tianjin 300350;College of Cyber Science,Nankai University,Tianjin 300350;College of Software,Nankai University,Tianjin 300350;Tianjin Key Laboratory of Network and Data Security Technology(Nankai University),Tianjin 300350;State Key Lab of Processors(Institute of Computing Technology,Chinese Academy of Sciences),Beijing 100190)

机构地区:[1]南开大学计算机学院,天津300350 [2]南开大学网络空间安全学院,天津300350 [3]南开大学软件学院,天津300350 [4]天津市网络与数据安全技术重点实验室(南开大学),天津300350 [5]处理器芯片全国重点实验室(中国科学院计算技术研究所),北京100190

出  处:《计算机研究与发展》2023年第5期1053-1072,共20页Journal of Computer Research and Development

基  金:国家自然科学基金项目(62002175);计算机体系结构国家重点实验室(中国科学院计算技术研究所)开放课题(CARCHB202016);天津市企业优秀科技特派员项目(21YDTPJC00380);中国民航大学信息安全测评中心开放基金项目(ISECCA-202102);CCF-华为胡杨林基金项目(CCF-HuaweiTC2022005)。

摘  要:传统的卷积神经网络加速器及推理框架在资源约束的FPGA上部署模型时,往往面临设备种类繁多且资源极端受限、数据带宽利用不充分、算子操作类型复杂难以适配且调度不合理等诸多挑战.提出一种面向嵌入式FPGA的卷积神经网络稀疏化加速框架(sparse acceleration framework of convolutional neural network, SAF-CNN),通过软硬件协同设计的方法,从硬件加速器与软件推理框架2个角度进行联合优化.首先, SAF-CNN构建并行计算阵列,并且设计并行编解码方案,实现单周期多数据的传输,有效减少通信代价.其次,设计细粒度结构化块划分剪枝算法,于输入通道维度进行块内裁剪来获得稀疏且规则的权重矩阵,借此显著降低计算规模和DSP乘法器等资源占用.然后,提出一种兼容深度可分离卷积的输入通道维度动态拓展及运行时调度策略,实现输入通道参数灵活适配与逐通道卷积和逐点卷积的资源复用.最后,提出一种计算图重构及硬件算子融合优化方法,提升硬件执行效率.实验采用2种资源受限的低端FPGA异构平台Intel CycloneV与Xilinx ZU3EG,结果表明SAF-CNN加速器可分别实现76.3GOPS与494.3GOPS的计算性能.与多核CPU相比,SAF-CNN在进行SSD_MobileNetV1目标模型检测时,可实现3.5倍与2.2倍的性能提升,模型推理速度高达26.5fps.When deploying models on resource-constrained FPGAs,traditional convolutional neural network accelerators and inference frameworks often face challenges such as various device types,extremely limited resources,insufficient data bandwidth utilization,complex operator types that are difficult to match operators and schedule computing task reasonably.In this paper,a sparse acceleration framework of convolutional neural network(SAF-CNN)for embedded FPGA is proposed.Through the method of software and hardware co-design,SAF-CNN is jointly optimized from the two perspectives of hardware accelerator design and software inference framework.SAF-CNN first constructs parallel computing array and designs parallel encoding and decoding scheme to realize single-period multi-data transmission and effectively reduce communication costs.Secondly,a fine-grained structured block partitioning pruning algorithm is designed to obtain a sparse and regular weight matrix by cutting the input channel dimension within the block,so as to significantly reduce the computation scale and the resource utilization of DSP multiplier.Then,the input channel dimension dynamic expansion method and runtime scheduling strategy compatible with depth-separable convolution is proposed to realize flexible adaptation of input channel parameters and resource reuse of point-wise convolution and depth-wise convolution.Finally,the computational graph reconstruction method and hardware operator fusion are used to improve the hardware execution efficiency.The experiments use two resource-limited low-end FPGA heterogeneous platforms,Intel CycloneV and Xilinx ZU3EG.The results show that the SAF-CNN accelerator can achieve the computational performance of 76.3GOPS and 494.3GOPS respectively.Compared with multi-core CPU,SAF-CNN can achieve 3.5x and 2.2x performance improvement on the object detection model of SSD_MobileNetV1,and the model inference speed is up to 26.5fps.

关 键 词:卷积神经网络 模型压缩 计算图 加速器设计 推理框架 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象