检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:郭开诚 吴承刚 张伟丰 戚正伟 管海兵[1] GUO Kai-Cheng;WU Cheng-Gang;ZHANG Wei-Feng;QI Zheng-Wei;Guan Hai-Bing(School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240;Alibaba Group,Hangzhou 311121)
机构地区:[1]上海交通大学电子信息与电气工程学院,上海200240 [2]阿里巴巴集团,杭州311121
出 处:《计算机学报》2021年第12期2529-2541,共13页Chinese Journal of Computers
基 金:国家自然科学基金(61672344,61525204,61732010);国家重点研发计划(2016YFB1000502);阿里巴巴创新研究计划(AIR)资助。
摘 要:近年来,现场可编程逻辑门阵列(FPGA)由于其灵活的可定制性和优秀的并行性,在硬件加速卷积神经网络(CNN)的研究和应用中吸引了广泛的关注.这些工作主要集中在两方面:对特定硬件加速模块的设计和优化以及对一类网络模型的通用加速硬件设计.前者一般是基于数据流的针对固定网络的设计,通过牺牲通用性来换取性能;后者一般是基于指令集能够加速一类模型的设计,通过牺牲性能来换取通用性.为了能够灵活地应对不同的需求,本文提出一种通过管理不同粒度算子来平衡性能与通用性的fGrain框架.该框架一方面利用底层基于数据流的算子设计来充分发挥硬件性能,另一方面通过虚拟化层来管理算子映射提供灵活性.实验表明,相比GPU推理延迟至多有25%的提升,而虚拟化性能损失仅在1.3%以下.In recent years,field-programmable logic gate arrays(FPGAs) have attracted much attention in hardware-accelerated convolutional neural network(CNN) research and applications due to their flexible customizability and excellent parallelism.In particular,FPGA-based accelerators have excellent performance in specific network structures and low-precision inference scenarios by customizing at the hardware level.These works have focused on two main areas:the design and optimization of specific hardware acceleration modules and the design of generic acceleration hardware for a class of network models.The former is generally a dataflow-based design for a fixed network,sacrificing generality for performance;the latter is generally an instruction set-based design capable of accelerating a class of models,sacrificing performance for generality.An accelerator under stream-based architecture accelerates CNN models by instantiating several processing units(PU).Usually,each PU only provides a specific functionality,such as computing a convolution layer.Compared with instruction-based architecture,stream-based architecture is more friendly to network-dependent optimizations.In an instruction-based architecture,the hardware accelerator comes with an instruction set.The software compiles the CNN model into a sequence of instructions,which is then executed by the accelerator.That is,flexibility comes from software rather than from firmware for the FPGA.As a result,the reconfigurability of FPGAs is not fully utilized.To respond to different requirements flexibly,this paper proposes the fGrain framework that balances performance and generality by managing the granularity of operators.The design of fGrain is based on modern machine learning systems.It combines the hardware accelerators design and the software framework,which are both based on dataflow graph design.We provide the abstraction of FPGA hardware resources as operators and manage FPGA resources at the granularity of operators in the software machine learning framework.We als
关 键 词:卷积神经网络 现场可编程逻辑门阵列 机器学习系统 用户态虚拟化 开放编程语言
分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49