基于国产众核处理器的深度神经网络算子加速库优化  被引量:6

Deep Neural Network Operator Acceleration Library Optimization Based on Domestic Many-core Processor

在线阅读下载全文

作  者:高捷 刘沙[2] 黄则强 郑天宇 刘鑫[2] 漆锋滨[2] GAO Jie;LIU Sha;HUANG Ze-qiang;ZHENG Tian-yu;LIU Xin;QI Feng-bin(Department of Cyberspace Security Academy,Information Engineering University,Zhengzhou 450000,China;Jiangnan Institute of Computing Technology,Wuxi,Jiangsu 214083,China;School of Software,Shangdong University,Jinan 250101,China)

机构地区:[1]信息工程大学网络空间安全学院,郑州450000 [2]江南计算技术研究所,江苏无锡214083 [3]山东大学软件学院,济南250101

出  处:《计算机科学》2022年第5期355-362,共8页Computer Science

基  金:国家自然科学基金(U1806205)。

摘  要:基于不同硬件设备的算子加速库已经成为深度学习框架不可或缺的一部分,能够为大规模训练或者推理任务提供数倍的性能加速。当前的主流算子库都是基于GPU架构开发的,与其他异构设计并不兼容;SWDNN算子库是基于申威26010开发的,无法充分发挥升级后的申威26010 pro处理器的性能,也不能满足当前GPT-3等大型神经网络模型对大容量内存和高访存带宽的需求。文中面向申威26010 pro处理器体系结构的特点和大型神经网络模型的训练需求,提出了基于多核组的三级并行和神经网络算子任务调度方案,在满足大型模型训练内存需求的同时,提高了并行效率和整体计算性能;提出了三级异步流水机制和计算访存重叠的访存优化方法,显著缓解了神经网络算子的访存性能瓶颈。基于以上方法,文中构建了基于申威26010 pro处理器的SWTensor多核组算子加速库,在自然语言处理模型GPT-2上进行了实验,结果表明,其典型计算密集型算子和访存密集型算子在单精度浮点计算性能和访存带宽上分别达到了理论峰值的90.4%和88.7%。Operator acceleration libraries based on different hardware devices have become an indispensable part of deep learning framework,which can provide performance improvement for large-scale training or inference tasks dramatically.The current main-stream operator libraries are all developed based on GPU architecture,which is not compatible with other heterogeneous designs.SWDNN operator library is based on the development of SW26010 processor,which can not give full play to the performance of the upgraded SW26010 pro processor,nor can it meet the needs of the current large neural network models such as GPT-3 for large memory capacity and high memory access bandwidth.According to the architecture characteristics of SW26010 pro processor and the training requirements of large neural network model,a three-level parallel and neural network operator task sche-duling scheme based on multi-core group is proposed,which can satisfy the memory requirements of large model training and improve the overall computing performance and parallel efficiency.A memory access optimization method with triple asynchronous flow and overlap of computation and memory access is proposed,which significantly alleviates the memory access performance bottleneck of neural network operators.Based on the above methods,this paper constructs the SWTensor many-core group operator acceleration library based on the SW26010 pro processor.The experimental results of natural language processing model GPT-2 show that,computation-intensive operators and memory access intensive operators in SWTensor operator library reach the maxi-mum of 90.4%and 88.7%of the theoretical peak values respectively in single-precision floating-point computing performance and memory access bandwidth.

关 键 词:深度神经网络 算子加速库 负载均衡 异步流水 双缓冲 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象