面向SW26010-Pro的1、2级BLAS函数众核并行优化技术  被引量:1

Many-core Optimization of Level 1 and Level 2 BLAS Routines on SW26010-Pro

在线阅读下载全文

作  者:胡怡 陈道琨 杨超 刘芳芳[1,2] 马文静 尹万旺[4] 袁欣辉 林蓉芬 HU Yi;CHEN Dao-Kun;YANG Chao;LIU Fang-Fang;MA Wen-Jing;YIN Wan-Wang;YUAN Xin-Hui;LIN Rong-Fen(Laboratory of Parallel Software and Computational Science,Institute of Software,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China;School of Mathematical Sciences,Peking University,Beijing 100871,China;National Research Center of Parallel Computer Engineering and Technology,Beijing 100190,China)

机构地区:[1]中国科学院软件研究所并行软件与计算科学实验室,北京100190 [2]中国科学院大学,北京100049 [3]北京大学数学科学学院,北京100871 [4]国家并行计算机工程技术研究中心,北京100190

出  处:《软件学报》2023年第9期4421-4436,共16页Journal of Software

基  金:国家重点研发计划(2020YFB0204601)。

摘  要:BLAS (basic linear algebra subprograms)是高性能扩展数学库的一个重要模块,广泛应用于科学与工程计算领域. BLAS 1级提供向量-向量运算, BLAS 2级提供矩阵-向量运算.针对国产SW26010-Pro众核处理器设计并实现了高性能BLAS 1、2级函数.基于RMA通信机制设计了从核归约策略,提升了BLAS 1、2级若干函数的归约效率.针对TRSV、TPSV等存在数据依赖关系的函数,提出了一套高效并行算法,该算法通过点对点同步维持数据依赖关系,设计了适用于三角矩阵的高效任务映射机制,有效减少了从核点对点同步的次数,提高了函数的执行效率.通过自适应优化、向量压缩、数据复用等技术,进一步提升了BLAS 1、2级函数的访存带宽利用率.实验结果显示, BLAS 1级函数的访存带宽利用率最高可达95%,平均可达90%以上, BLAS 2级函数的访存带宽利用率最高可达98%,平均可达80%以上.与广泛使用的开源数学库GotoBLAS相比, BLAS 1、2级函数分别取得了平均18.78倍和25.96倍的加速效果. LU分解、QR分解以及对称特征值问题通过调用所提出的高性能BLAS 1、2级函数取得了平均10.99倍的加速效果.BLAS(basic linear algebra subprograms)is an important module of the high-performance extended math library,which is widely used in the field of scientific and engineering computing.Level 1 BLAS provides vector-vector operation,Level 2 BLAS provides matrix-vector operation.This study designs and implements highly optimized Level 1 and Level 2 BLAS routines for SW26010-Pro,a domestic many-core processor.A reduction strategy among CPEs is designed based on the RMA communication mechanism,which improves the reduction efficiency of many Level 1 and Level 2 BLAS routines.For TRSV and TPSV and other routines that have data dependencies,a series of efficient parallelization algorithms are proposed.The algorithm maintains data dependencies through point-topoint synchronization and designs an efficient task mapping mechanism that is suitable for triangular matrices,which reduces the number of point-to-point synchronizations effectively,and improves the execution efficiency.In this study,adaptive optimization,vector compression,data multiplexing,and other technologies have further improved the memory access bandwidth utilization of Level 1 and Level 2 BLAS routines.The experimental results show that the memory access bandwidth utilization rate of the Level 1 BLAS routines can reach as high as 95%,with an average bandwidth of more than 90%.The memory access bandwidth utilization rate of Level 2 BLAS routines can reach 98%,with an average bandwidth of more than 80%.Compared with the widely used open-source linear algebra library GotoBLAS,the proposed implementation of Level 1 and Level 2 BLAS routines achieved an average speedup of 18.78 times and 25.96 times.With the optimized Level 1 and Level 2 BLAS routines,LQ decomposition,QR decomposition,and eigenvalue problems achieved an average speedup of 10.99 times.

关 键 词:BLAS 1级 BLAS 2级 访存带宽 SW26010-Pro众核处理器 RMA通信 点对点同步 自适应优化 

分 类 号:TP303[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象