面向细粒度FPGA管理的CNN异构加速框架  被引量:2

A Heterogeneous Framework to Accelerate CNNs with Fine-Grained FPGA Management

在线阅读下载全文

作  者:郭开诚 吴承刚 张伟丰 戚正伟 管海兵[1] GUO Kai-Cheng;WU Cheng-Gang;ZHANG Wei-Feng;QI Zheng-Wei;Guan Hai-Bing(School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240;Alibaba Group,Hangzhou 311121)

机构地区:[1]上海交通大学电子信息与电气工程学院,上海200240 [2]阿里巴巴集团,杭州311121

出  处:《计算机学报》2021年第12期2529-2541,共13页Chinese Journal of Computers

基  金:国家自然科学基金(61672344,61525204,61732010);国家重点研发计划(2016YFB1000502);阿里巴巴创新研究计划(AIR)资助。

摘  要:近年来,现场可编程逻辑门阵列(FPGA)由于其灵活的可定制性和优秀的并行性,在硬件加速卷积神经网络(CNN)的研究和应用中吸引了广泛的关注.这些工作主要集中在两方面:对特定硬件加速模块的设计和优化以及对一类网络模型的通用加速硬件设计.前者一般是基于数据流的针对固定网络的设计,通过牺牲通用性来换取性能;后者一般是基于指令集能够加速一类模型的设计,通过牺牲性能来换取通用性.为了能够灵活地应对不同的需求,本文提出一种通过管理不同粒度算子来平衡性能与通用性的fGrain框架.该框架一方面利用底层基于数据流的算子设计来充分发挥硬件性能,另一方面通过虚拟化层来管理算子映射提供灵活性.实验表明,相比GPU推理延迟至多有25%的提升,而虚拟化性能损失仅在1.3%以下.In recent years,field-programmable logic gate arrays(FPGAs) have attracted much attention in hardware-accelerated convolutional neural network(CNN) research and applications due to their flexible customizability and excellent parallelism.In particular,FPGA-based accelerators have excellent performance in specific network structures and low-precision inference scenarios by customizing at the hardware level.These works have focused on two main areas:the design and optimization of specific hardware acceleration modules and the design of generic acceleration hardware for a class of network models.The former is generally a dataflow-based design for a fixed network,sacrificing generality for performance;the latter is generally an instruction set-based design capable of accelerating a class of models,sacrificing performance for generality.An accelerator under stream-based architecture accelerates CNN models by instantiating several processing units(PU).Usually,each PU only provides a specific functionality,such as computing a convolution layer.Compared with instruction-based architecture,stream-based architecture is more friendly to network-dependent optimizations.In an instruction-based architecture,the hardware accelerator comes with an instruction set.The software compiles the CNN model into a sequence of instructions,which is then executed by the accelerator.That is,flexibility comes from software rather than from firmware for the FPGA.As a result,the reconfigurability of FPGAs is not fully utilized.To respond to different requirements flexibly,this paper proposes the fGrain framework that balances performance and generality by managing the granularity of operators.The design of fGrain is based on modern machine learning systems.It combines the hardware accelerators design and the software framework,which are both based on dataflow graph design.We provide the abstraction of FPGA hardware resources as operators and manage FPGA resources at the granularity of operators in the software machine learning framework.We als

关 键 词:卷积神经网络 现场可编程逻辑门阵列 机器学习系统 用户态虚拟化 开放编程语言 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象