基于RISC-V Matrix指令集扩展的LLM矢量点积加速研究  

Research on LLM Vector Dot Product Acceleration Based on RISC-V Matrix Instruction Set Extension

在线阅读下载全文

作  者:陈煦豪 胡思鹏 刘洪超 刘伯然 唐丹 赵地 CHEN Xuhao;HU Sipeng;LIU Hongchao;LIU Boran;TANG Dan;ZHAO Di(Beijing Institute of Open Source Chip,Beijing 100080,China;School of Information Science and Technology,ShanghaiTech University,Shanghai 210210,China;Henan Institute of Advanced Technology,Zhengzhou University,Zhengzhou 450003,China;State Key Lab of Processors,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China)

机构地区:[1]北京开源芯片研究院,北京100080 [2]上海科技大学信息科学与技术学院,上海210210 [3]郑州大学河南先进技术研究院,郑州450003 [4]中国科学院计算技术研究所处理器芯片全国重点实验室,北京100190 [5]中国科学院大学,北京100049

出  处:《计算机科学》2025年第5期83-90,共8页Computer Science

基  金:中国科学院战略性先导科技专项(XDA0320300)。

摘  要:鉴于边缘AI的高性能与低功耗需求,基于RISC-V指令集架构,针对边缘设备数字信号处理的实际问题,设计了一种边缘AI的专用指令集处理器,在有限的硬件开销下,提升了边缘AI的执行效率,降低了边缘AI的能量消耗,能够满足边缘AI应用中进行高效大语言模型(LLM)推理计算的需求。针对大语言模型的特性,基于RISC-V指令集扩展了自定义指令完成矢量点积计算,在专用的矢量点积加速硬件上进行大语言模型的运算加速;基于开源高性能RISC-V处理器核“香山”nanhu版本架构,实现了矢量点积专用指令集处理器nanhu-vdot,其在高性能处理器“香山”(nanhu版本)的基础上增加了矢量点积计算单元以及流水线处理逻辑;对nanhu-vdot进行FPGA硬件测试,在几乎没有增加额外的硬件资源和功耗消耗的前提下,矢量点积运算速度相比标量方法提高4倍以上,使用软硬件协同方案进行第二代生成式预训练(Generative Pre-Trained-2,GPT-2)模型推理,相比纯软件实现,速度提高了约30%。Considering the high-performance and low-power requirements of edge AI,this paper designs a specialized instruction set processor for edge AI based on the RISC-V instruction set architecture,addressing practical issues in digital signal processing for edge devices.This design enhances the execution efficiency of edge AI and reduces its energy consumption with limited hardware overhead,meeting the demands for efficient large language model(LLM)inference computation in edge AI applications.For the characteristics of large language models,custom instructions were extended based on the RISC-V instruction set to perform vector dot product calculations,accelerating the computation of large language models on dedicated vector dot product acceleration hardware.Based on the open-source high-performance RISC-V processor core XiangShan Nanhu architecture,the vector dot product specialized instruction set processor Nanhu-vdot is implemented,which adds vector dot product calculation units and pipeline processing logic on top of the XiangShan Nanhu.The Nanhu-vdot underwent FPGA hardware testing achieves over four times of the speed of scalar methods in vector dot product computation.Using a hardware-software co-design approach for second-generation generative pre-trained Transformer(GPT-2)model inference,the speed improves by approximately 30%compared to pure software implementation with almost no additional consumption of hardware resources and power consumption.

关 键 词:指令集扩展 矢量点积 软硬件协同 大语言模型推理 

分 类 号:TP302[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象