SDAA:面向申威智能加速卡的运行时系统

SDAA:Runtime System for Shenwei AI Acceleration Card

作　　者：赵玉龙张鲁飞许国春李宇轩孙茹君刘鑫 ZHAO Yu-Long;ZHANG Lu-Fei;XU Guo-Chun;LI Yu-Xuan;SUN Ru-Jun;LIU Xin(State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi 214125,China;Wuxi Institue of Advanced Technology,Wuxi 214125,China;Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China;National Research Center of Parallel Computer Engineering and Technology,Beijing 100083,China)

机构地区：[1]数学工程与先进计算国家重点实验室,江苏无锡214125 [2]无锡先进技术研究院,江苏无锡214125 [3]清华大学计算机科学与技术系,北京100084 [4]国家并行计算机工程技术研究中心,北京100083

出　　处：《软件学报》2024年第12期5710-5724,共15页Journal of Software

基　　金：国家重点研发计划(2018ZX01028102)。

摘　　要：自主研制的申威智能加速卡上搭载了脉动阵列增强的申威众核处理器,其智能计算能力与主流GPU相当,但仍缺少配套的基础软件.为降低申威智能加速卡的使用门槛,有效支撑人工智能应用开发,设计面向申威智能加速卡的运行时系统SDAA,语义与主流的CUDA运行时保持一致.针对内存管理、数据传输、核函数启动等关键路径,采用软硬协同的设计方法实现卡上段页结合的多级内存分配算法、可分页内存多线程多通道的传输模型、多异构部件自适应的数据传输算法和基于片上阵列通信的快速核函数启动方法,使得SDAA运行时性能优于主流GPU.实验结果表明,SDAA运行时系统的内存分配速度是NVIDIA V100对应接口的120倍,数据传输开销是对应接口的1/2,数据传输带宽达到对应接口的1.7倍,核函数启动时间与对应接口相当.SDAA运行时已支撑主流框架和实际模型训练在申威智能加速卡上的高效运行.The homegrown Shenwei AI acceleration card is equipped with the Shenwei many-core processor based on systolic array enhancement,and although its intelligent computing power can be comparable to the mainstream GPU,there is still a lack of basic software support.To lower the utilization threshold of the Shenwei AI acceleration card and effectively support the development of AI applications,this study designs a runtime system SDAA for the Shenwei AI acceleration card,whose semantics is consistent with the mainstream CUDA.For key paths such as memory management,data transmission,and kernel function launch,the software and hardware co-design method is adopted to realize the multi-level memory allocation algorithm with segment and paged memory combined on the card,pageable memory transmission model of multiple threads and channels,adaptive data transmission algorithm with multi-heterogeneous components,and fast kernel function launch method based on on-chip array communication.As a result,the runtime performance of SDAA is better than that of the mainstream GPU.The experimental results indicate that the memory allocation speed of SDAA is 120 times the corresponding interface of NVIDIA V100,the memory transmission overhead is 1/2 of the corresponding interface,and the data transmission bandwidth is 1.7 times the corresponding interface.Additionally,the launch time of the kernel function is equivalent to the corresponding interface,and thus the SDAA runtime system can support the efficient operation of mainstream frameworks and actual model training on the Shenwei AI acceleration card.

关键词：运行时系统申威智能加速卡人工智能软件定义

分类号：TP303[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

SDAA:面向申威智能加速卡的运行时系统

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

SDAA:面向申威智能加速卡的运行时系统

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索