面向SW26010-Pro的1、2级BLAS函数众核并行优化技术被引量：2

Many-core Optimization of Level 1 and Level 2 BLAS Routines on SW26010-Pro

作　　者：胡怡陈道琨杨超刘芳芳[1,2] 马文静尹万旺[4] 袁欣辉林蓉芬 HU Yi;CHEN Dao-Kun;YANG Chao;LIU Fang-Fang;MA Wen-Jing;YIN Wan-Wang;YUAN Xin-Hui;LIN Rong-Fen(Laboratory of Parallel Software and Computational Science,Institute of Software,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China;School of Mathematical Sciences,Peking University,Beijing 100871,China;National Research Center of Parallel Computer Engineering and Technology,Beijing 100190,China)

机构地区：[1]中国科学院软件研究所并行软件与计算科学实验室,北京100190 [2]中国科学院大学,北京100049 [3]北京大学数学科学学院,北京100871 [4]国家并行计算机工程技术研究中心,北京100190

出　　处：《软件学报》2023年第9期4421-4436,共16页Journal of Software

基　　金：国家重点研发计划(2020YFB0204601)。

摘　　要：BLAS (basic linear algebra subprograms)是高性能扩展数学库的一个重要模块,广泛应用于科学与工程计算领域. BLAS 1级提供向量-向量运算, BLAS 2级提供矩阵-向量运算.针对国产SW26010-Pro众核处理器设计并实现了高性能BLAS 1、2级函数.基于RMA通信机制设计了从核归约策略,提升了BLAS 1、2级若干函数的归约效率.针对TRSV、TPSV等存在数据依赖关系的函数,提出了一套高效并行算法,该算法通过点对点同步维持数据依赖关系,设计了适用于三角矩阵的高效任务映射机制,有效减少了从核点对点同步的次数,提高了函数的执行效率.通过自适应优化、向量压缩、数据复用等技术,进一步提升了BLAS 1、2级函数的访存带宽利用率.实验结果显示, BLAS 1级函数的访存带宽利用率最高可达95%,平均可达90%以上, BLAS 2级函数的访存带宽利用率最高可达98%,平均可达80%以上.与广泛使用的开源数学库GotoBLAS相比, BLAS 1、2级函数分别取得了平均18.78倍和25.96倍的加速效果. LU分解、QR分解以及对称特征值问题通过调用所提出的高性能BLAS 1、2级函数取得了平均10.99倍的加速效果.BLAS(basic linear algebra subprograms)is an important module of the high-performance extended math library,which is widely used in the field of scientific and engineering computing.Level 1 BLAS provides vector-vector operation,Level 2 BLAS provides matrix-vector operation.This study designs and implements highly optimized Level 1 and Level 2 BLAS routines for SW26010-Pro,a domestic many-core processor.A reduction strategy among CPEs is designed based on the RMA communication mechanism,which improves the reduction efficiency of many Level 1 and Level 2 BLAS routines.For TRSV and TPSV and other routines that have data dependencies,a series of efficient parallelization algorithms are proposed.The algorithm maintains data dependencies through point-topoint synchronization and designs an efficient task mapping mechanism that is suitable for triangular matrices,which reduces the number of point-to-point synchronizations effectively,and improves the execution efficiency.In this study,adaptive optimization,vector compression,data multiplexing,and other technologies have further improved the memory access bandwidth utilization of Level 1 and Level 2 BLAS routines.The experimental results show that the memory access bandwidth utilization rate of the Level 1 BLAS routines can reach as high as 95%,with an average bandwidth of more than 90%.The memory access bandwidth utilization rate of Level 2 BLAS routines can reach 98%,with an average bandwidth of more than 80%.Compared with the widely used open-source linear algebra library GotoBLAS,the proposed implementation of Level 1 and Level 2 BLAS routines achieved an average speedup of 18.78 times and 25.96 times.With the optimized Level 1 and Level 2 BLAS routines,LQ decomposition,QR decomposition,and eigenvalue problems achieved an average speedup of 10.99 times.

关键词：BLAS 1级 BLAS 2级访存带宽 SW26010-Pro众核处理器 RMA通信点对点同步自适应优化

分类号：TP303[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向SW26010-Pro的1、2级BLAS函数众核并行优化技术被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向SW26010-Pro的1、2级BLAS函数众核并行优化技术 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

面向SW26010-Pro的1、2级BLAS函数众核并行优化技术被引量：2