检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:胡怡 陈道琨 杨超 刘芳芳[1,2] 马文静 尹万旺[4] 袁欣辉 林蓉芬 HU Yi;CHEN Dao-Kun;YANG Chao;LIU Fang-Fang;MA Wen-Jing;YIN Wan-Wang;YUAN Xin-Hui;LIN Rong-Fen(Laboratory of Parallel Software and Computational Science,Institute of Software,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China;School of Mathematical Sciences,Peking University,Beijing 100871,China;National Research Center of Parallel Computer Engineering and Technology,Beijing 100190,China)
机构地区:[1]中国科学院软件研究所并行软件与计算科学实验室,北京100190 [2]中国科学院大学,北京100049 [3]北京大学数学科学学院,北京100871 [4]国家并行计算机工程技术研究中心,北京100190
出 处:《软件学报》2023年第9期4421-4436,共16页Journal of Software
基 金:国家重点研发计划(2020YFB0204601)。
摘 要:BLAS (basic linear algebra subprograms)是高性能扩展数学库的一个重要模块,广泛应用于科学与工程计算领域. BLAS 1级提供向量-向量运算, BLAS 2级提供矩阵-向量运算.针对国产SW26010-Pro众核处理器设计并实现了高性能BLAS 1、2级函数.基于RMA通信机制设计了从核归约策略,提升了BLAS 1、2级若干函数的归约效率.针对TRSV、TPSV等存在数据依赖关系的函数,提出了一套高效并行算法,该算法通过点对点同步维持数据依赖关系,设计了适用于三角矩阵的高效任务映射机制,有效减少了从核点对点同步的次数,提高了函数的执行效率.通过自适应优化、向量压缩、数据复用等技术,进一步提升了BLAS 1、2级函数的访存带宽利用率.实验结果显示, BLAS 1级函数的访存带宽利用率最高可达95%,平均可达90%以上, BLAS 2级函数的访存带宽利用率最高可达98%,平均可达80%以上.与广泛使用的开源数学库GotoBLAS相比, BLAS 1、2级函数分别取得了平均18.78倍和25.96倍的加速效果. LU分解、QR分解以及对称特征值问题通过调用所提出的高性能BLAS 1、2级函数取得了平均10.99倍的加速效果.BLAS(basic linear algebra subprograms)is an important module of the high-performance extended math library,which is widely used in the field of scientific and engineering computing.Level 1 BLAS provides vector-vector operation,Level 2 BLAS provides matrix-vector operation.This study designs and implements highly optimized Level 1 and Level 2 BLAS routines for SW26010-Pro,a domestic many-core processor.A reduction strategy among CPEs is designed based on the RMA communication mechanism,which improves the reduction efficiency of many Level 1 and Level 2 BLAS routines.For TRSV and TPSV and other routines that have data dependencies,a series of efficient parallelization algorithms are proposed.The algorithm maintains data dependencies through point-topoint synchronization and designs an efficient task mapping mechanism that is suitable for triangular matrices,which reduces the number of point-to-point synchronizations effectively,and improves the execution efficiency.In this study,adaptive optimization,vector compression,data multiplexing,and other technologies have further improved the memory access bandwidth utilization of Level 1 and Level 2 BLAS routines.The experimental results show that the memory access bandwidth utilization rate of the Level 1 BLAS routines can reach as high as 95%,with an average bandwidth of more than 90%.The memory access bandwidth utilization rate of Level 2 BLAS routines can reach 98%,with an average bandwidth of more than 80%.Compared with the widely used open-source linear algebra library GotoBLAS,the proposed implementation of Level 1 and Level 2 BLAS routines achieved an average speedup of 18.78 times and 25.96 times.With the optimized Level 1 and Level 2 BLAS routines,LQ decomposition,QR decomposition,and eigenvalue problems achieved an average speedup of 10.99 times.
关 键 词:BLAS 1级 BLAS 2级 访存带宽 SW26010-Pro众核处理器 RMA通信 点对点同步 自适应优化
分 类 号:TP303[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.147.59.250