检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:赵玉龙 张鲁飞 许国春 李宇轩 孙茹君 刘鑫 ZHAO Yu-Long;ZHANG Lu-Fei;XU Guo-Chun;LI Yu-Xuan;SUN Ru-Jun;LIU Xin(State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi 214125,China;Wuxi Institue of Advanced Technology,Wuxi 214125,China;Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China;National Research Center of Parallel Computer Engineering and Technology,Beijing 100083,China)
机构地区:[1]数学工程与先进计算国家重点实验室,江苏无锡214125 [2]无锡先进技术研究院,江苏无锡214125 [3]清华大学计算机科学与技术系,北京100084 [4]国家并行计算机工程技术研究中心,北京100083
出 处:《软件学报》2024年第12期5710-5724,共15页Journal of Software
基 金:国家重点研发计划(2018ZX01028102)。
摘 要:自主研制的申威智能加速卡上搭载了脉动阵列增强的申威众核处理器,其智能计算能力与主流GPU相当,但仍缺少配套的基础软件.为降低申威智能加速卡的使用门槛,有效支撑人工智能应用开发,设计面向申威智能加速卡的运行时系统SDAA,语义与主流的CUDA运行时保持一致.针对内存管理、数据传输、核函数启动等关键路径,采用软硬协同的设计方法实现卡上段页结合的多级内存分配算法、可分页内存多线程多通道的传输模型、多异构部件自适应的数据传输算法和基于片上阵列通信的快速核函数启动方法,使得SDAA运行时性能优于主流GPU.实验结果表明,SDAA运行时系统的内存分配速度是NVIDIA V100对应接口的120倍,数据传输开销是对应接口的1/2,数据传输带宽达到对应接口的1.7倍,核函数启动时间与对应接口相当.SDAA运行时已支撑主流框架和实际模型训练在申威智能加速卡上的高效运行.The homegrown Shenwei AI acceleration card is equipped with the Shenwei many-core processor based on systolic array enhancement,and although its intelligent computing power can be comparable to the mainstream GPU,there is still a lack of basic software support.To lower the utilization threshold of the Shenwei AI acceleration card and effectively support the development of AI applications,this study designs a runtime system SDAA for the Shenwei AI acceleration card,whose semantics is consistent with the mainstream CUDA.For key paths such as memory management,data transmission,and kernel function launch,the software and hardware co-design method is adopted to realize the multi-level memory allocation algorithm with segment and paged memory combined on the card,pageable memory transmission model of multiple threads and channels,adaptive data transmission algorithm with multi-heterogeneous components,and fast kernel function launch method based on on-chip array communication.As a result,the runtime performance of SDAA is better than that of the mainstream GPU.The experimental results indicate that the memory allocation speed of SDAA is 120 times the corresponding interface of NVIDIA V100,the memory transmission overhead is 1/2 of the corresponding interface,and the data transmission bandwidth is 1.7 times the corresponding interface.Additionally,the launch time of the kernel function is equivalent to the corresponding interface,and thus the SDAA runtime system can support the efficient operation of mainstream frameworks and actual model training on the Shenwei AI acceleration card.
分 类 号:TP303[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.188.163.142