检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:陈煦豪 胡思鹏 刘洪超 刘伯然 唐丹 赵地 CHEN Xuhao;HU Sipeng;LIU Hongchao;LIU Boran;TANG Dan;ZHAO Di(Beijing Institute of Open Source Chip,Beijing 100080,China;School of Information Science and Technology,ShanghaiTech University,Shanghai 210210,China;Henan Institute of Advanced Technology,Zhengzhou University,Zhengzhou 450003,China;State Key Lab of Processors,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China)
机构地区:[1]北京开源芯片研究院,北京100080 [2]上海科技大学信息科学与技术学院,上海210210 [3]郑州大学河南先进技术研究院,郑州450003 [4]中国科学院计算技术研究所处理器芯片全国重点实验室,北京100190 [5]中国科学院大学,北京100049
出 处:《计算机科学》2025年第5期83-90,共8页Computer Science
基 金:中国科学院战略性先导科技专项(XDA0320300)。
摘 要:鉴于边缘AI的高性能与低功耗需求,基于RISC-V指令集架构,针对边缘设备数字信号处理的实际问题,设计了一种边缘AI的专用指令集处理器,在有限的硬件开销下,提升了边缘AI的执行效率,降低了边缘AI的能量消耗,能够满足边缘AI应用中进行高效大语言模型(LLM)推理计算的需求。针对大语言模型的特性,基于RISC-V指令集扩展了自定义指令完成矢量点积计算,在专用的矢量点积加速硬件上进行大语言模型的运算加速;基于开源高性能RISC-V处理器核“香山”nanhu版本架构,实现了矢量点积专用指令集处理器nanhu-vdot,其在高性能处理器“香山”(nanhu版本)的基础上增加了矢量点积计算单元以及流水线处理逻辑;对nanhu-vdot进行FPGA硬件测试,在几乎没有增加额外的硬件资源和功耗消耗的前提下,矢量点积运算速度相比标量方法提高4倍以上,使用软硬件协同方案进行第二代生成式预训练(Generative Pre-Trained-2,GPT-2)模型推理,相比纯软件实现,速度提高了约30%。Considering the high-performance and low-power requirements of edge AI,this paper designs a specialized instruction set processor for edge AI based on the RISC-V instruction set architecture,addressing practical issues in digital signal processing for edge devices.This design enhances the execution efficiency of edge AI and reduces its energy consumption with limited hardware overhead,meeting the demands for efficient large language model(LLM)inference computation in edge AI applications.For the characteristics of large language models,custom instructions were extended based on the RISC-V instruction set to perform vector dot product calculations,accelerating the computation of large language models on dedicated vector dot product acceleration hardware.Based on the open-source high-performance RISC-V processor core XiangShan Nanhu architecture,the vector dot product specialized instruction set processor Nanhu-vdot is implemented,which adds vector dot product calculation units and pipeline processing logic on top of the XiangShan Nanhu.The Nanhu-vdot underwent FPGA hardware testing achieves over four times of the speed of scalar methods in vector dot product computation.Using a hardware-software co-design approach for second-generation generative pre-trained Transformer(GPT-2)model inference,the speed improves by approximately 30%compared to pure software implementation with almost no additional consumption of hardware resources and power consumption.
关 键 词:指令集扩展 矢量点积 软硬件协同 大语言模型推理
分 类 号:TP302[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49