面向3D-CNN的算法压缩-硬件设计协同优化  被引量:2

Algorithm Compression and Hardware Design Co-Optimization for 3D-CNN

在线阅读下载全文

作  者:钱佳明 娄文启 宫磊 王超[1,2] 周学海 QIAN Jiaming;LOU Wenqi;GONG Lei;WANG Chao;ZHOU Xuehai(School of Computer Science and Technology,University of Science and Technology of China,Hefei 230027,China;Suzhou Institute for Advanced Research,University of Science and Technology of China,Suzhou,Jiangsu 215123,China)

机构地区:[1]中国科学技术大学计算机科学与技术学院,合肥230027 [2]中国科学技术大学苏州高等研究院,江苏苏州215123

出  处:《计算机工程与应用》2023年第18期74-83,共10页Computer Engineering and Applications

基  金:国家电网公司总部科技项目(5700-202119266A-0-0-00)。

摘  要:近年来,三维卷积神经网络(3D-CNN)在计算机视频分类领域的优异表现使其受到了广泛关注。然而,相比于2D-CNN,3D-CNN显著增大的计算、存储需求不可避免地带来了部署时的性能与能效问题,严重限制了其在硬件资源受限场景下的适用性。为了应对该挑战,提出了一种面向3D-CNN高效部署的算法-硬件协同设计与优化方法3D FCirCNN。在算法优化层面,首次使用分块循环矩阵对3D-CNN进行压缩并且进一步通过快速傅里叶变换(fast Fourier transform,FFT)进行加速,在保证模型规则性的前提下显著降低了模型的计算和存储开销。在此基础上,引入了频域内的激活、批归一化以及池化操作,通过实现全频域推理有效消除了由于FFT所带来的时域/频域切换开销。在硬件设计层面,为分块循环矩阵压缩后的3D-CNN设计了一个专用的硬件加速架构,并作出了一系列面向硬件资源和内存带宽的优化。在Xilinx ZCU102 FPGA上的实验表明,相较于以往最先进的工作,3D FCirCNN在可接受的精度损失范围内(<2%)取得了16.68倍的性能提升和16.18倍的计算效率提升。Recently,3D convolutional neural networks have attracted significant attention due to their excellent perfor-mance in video classification.However,the enormous computing and storage requirements of 3D-CNN inevitably lead to performance and energy efficiency problems during deployment,which severely limits its applicability in scenarios with limited hardware resources.To tackle this challenge,this paper proposes an algorithm-hardware co-design and optimiza-tion method called 3D FCirCNN to deploy 3D-CNN efficiently.At the algorithm level,3D FCirCNN uses block circulant matrix to compress 3D-CNN for the first time and further accelerates it with the fast Fourier transform(FFT),significantly reducing the computation and storage overhead of the model while maintaining a regular network structure.On this basis,3D FCirCNN introduces activation,batch normalization,and pooling operations in the frequency domain to eliminate the frequent time domain/frequency domain switching overhead caused by FFT.At the hardware design level,3D FCirCNN designs a dedicated hardware architecture for the compressed 3D-CNN and makes a series of optimization oriented to hardware resources and memory bandwidth.Experiment on Xilinx ZCU102 FPGA shows that compared with the previ-ous state-of-the-art work,3D FCirCNN can achieve 16.68 times performance improvement and 16.18 times computational efficiency improvement within an acceptable accuracy loss(<2%).

关 键 词:三维卷积神经网络 循环矩阵 全频域 现场可编程门阵列 

分 类 号:TP302.1[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象