检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:钱佳明 娄文启 宫磊 王超[1,2] 周学海 QIAN Jiaming;LOU Wenqi;GONG Lei;WANG Chao;ZHOU Xuehai(School of Computer Science and Technology,University of Science and Technology of China,Hefei 230027,China;Suzhou Institute for Advanced Research,University of Science and Technology of China,Suzhou,Jiangsu 215123,China)
机构地区:[1]中国科学技术大学计算机科学与技术学院,合肥230027 [2]中国科学技术大学苏州高等研究院,江苏苏州215123
出 处:《计算机工程与应用》2023年第18期74-83,共10页Computer Engineering and Applications
基 金:国家电网公司总部科技项目(5700-202119266A-0-0-00)。
摘 要:近年来,三维卷积神经网络(3D-CNN)在计算机视频分类领域的优异表现使其受到了广泛关注。然而,相比于2D-CNN,3D-CNN显著增大的计算、存储需求不可避免地带来了部署时的性能与能效问题,严重限制了其在硬件资源受限场景下的适用性。为了应对该挑战,提出了一种面向3D-CNN高效部署的算法-硬件协同设计与优化方法3D FCirCNN。在算法优化层面,首次使用分块循环矩阵对3D-CNN进行压缩并且进一步通过快速傅里叶变换(fast Fourier transform,FFT)进行加速,在保证模型规则性的前提下显著降低了模型的计算和存储开销。在此基础上,引入了频域内的激活、批归一化以及池化操作,通过实现全频域推理有效消除了由于FFT所带来的时域/频域切换开销。在硬件设计层面,为分块循环矩阵压缩后的3D-CNN设计了一个专用的硬件加速架构,并作出了一系列面向硬件资源和内存带宽的优化。在Xilinx ZCU102 FPGA上的实验表明,相较于以往最先进的工作,3D FCirCNN在可接受的精度损失范围内(<2%)取得了16.68倍的性能提升和16.18倍的计算效率提升。Recently,3D convolutional neural networks have attracted significant attention due to their excellent perfor-mance in video classification.However,the enormous computing and storage requirements of 3D-CNN inevitably lead to performance and energy efficiency problems during deployment,which severely limits its applicability in scenarios with limited hardware resources.To tackle this challenge,this paper proposes an algorithm-hardware co-design and optimiza-tion method called 3D FCirCNN to deploy 3D-CNN efficiently.At the algorithm level,3D FCirCNN uses block circulant matrix to compress 3D-CNN for the first time and further accelerates it with the fast Fourier transform(FFT),significantly reducing the computation and storage overhead of the model while maintaining a regular network structure.On this basis,3D FCirCNN introduces activation,batch normalization,and pooling operations in the frequency domain to eliminate the frequent time domain/frequency domain switching overhead caused by FFT.At the hardware design level,3D FCirCNN designs a dedicated hardware architecture for the compressed 3D-CNN and makes a series of optimization oriented to hardware resources and memory bandwidth.Experiment on Xilinx ZCU102 FPGA shows that compared with the previ-ous state-of-the-art work,3D FCirCNN can achieve 16.68 times performance improvement and 16.18 times computational efficiency improvement within an acceptable accuracy loss(<2%).
关 键 词:三维卷积神经网络 循环矩阵 全频域 现场可编程门阵列
分 类 号:TP302.1[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.145.0.146