基于BLACS的2.5D并行矩阵乘法  被引量:1

New 2.5D Parallel Matrix Multiplication Algorithm Based on BLACS

在线阅读下载全文

作  者:廖霞 李胜国 卢宇彤[2] 杨灿群[1,3] LIAO Xia;LI Sheng-Guo;LU Yu-Tong;YANG Can-Qun(College of Computer Science,National University of Defense Technology,Changsha 410073;National Supercomputer Center in Guangzhou,Sun Yat-Sen University,Guangzhou 510006;Science and Technology on Parallel and Distributed Processing Laboratory,National University of Defense Technology,Changsha 410073)

机构地区:[1]国防科技大学计算机学院,长沙410073 [2]中山大学国家超级计算中心,广州510006 [3]国防科技大学并行与分布处理重点实验室,长沙410073

出  处:《计算机学报》2021年第5期1037-1050,共14页Chinese Journal of Computers

基  金:科技部重点研发计划项目(2018YFB0204301);国家重点研发计划(2018YFB0204303);国家自然科学基金项目(61872392,U1811461);国家数值风洞项目(NNW2019ZT6-B20、NNW2019ZT6-B21、NNW2019ZT5-A10);广东省自然科学基金(2018B030312002);广东省“珠江人才计划”引进创新创业团队(2016ZT06D211);湖南省面上项目(2019JJ40339);校科研项目(ZK18-03-01)资助~~

摘  要:并行矩阵乘法是线性代数中最重要的基本运算之一,同时也是许多科学应用的基石.随着高性能计算(HPC)向E级计算发展,并行矩阵乘法的通信开销所占比重越来越大.如何降低并行矩阵乘法的通信开销,提高并行矩阵乘的可扩展性是当前研究的热点之一.本文提出一种新型的分布式并行稠密矩阵乘算法,即2.5D版本的PUMMA(Parallel Universal Matrix Multiplication Algorithm)算法,该算法是通过将初始的进程分成c组,利用计算节点的额外内存,在每个进程组上同时存储矩阵A、B和执行1/c的PUMMA算法,最后通过规约操作来得到矩阵乘的最终结果.本文基于BLACS(Basic Linear Algebra Communication Subprograms)通信库实现了一种从2D到2.5D的新型数据重分配算法,与PUMMA算法相结合,最终得到2.5D PUMMA算法,可直接替换PDGEMM(Parallel Double-precision General Matrix-matrix Multiplication),具有良好的可移植性.与国际标准算法库ScaLAPACK(Scalable Linear Algebra PACKage)中的PDGEMM等经典2D算法相比,本文算法缩减了通信次数,提高了数据局部性,具有更好的可扩展性.在进程数较多时,例如4096进程时,系统测试表明相对PDGEMM的加速比可达到2.20~2.93.进一步地,本文将2.5D PUMMA算法应用于加速计算对称三对角矩阵的特征值分解,其加速比可达到1.2以上.本文通过大量数值算例分析了2.5D PUMMA算法的性能,并给出了实用性建议和总结了未来的工作.Matrix-matrix multiplication is one of the most important basic operations in linear algebra and a building block of many scientific applications.As HPC(High Performance Computing)moves towards Exa-scale,the degree of parallelism of HPC systems is increasing.The communication cost of parallel matrix multiplication will be more and more important.How to reduce the communication cost,improve the scalability of parallel matrix multiplication and make full use of the advantages of supercomputer computing resources are well-studied subjects.In this paper,a novel distributed parallel dense matrix multiplication algorithm is proposed,which can be regarded as 2.5D PUMMA(Parallel Universal Matrix Multiplication Algorithm)algorithm.The underlying idea of this implementation is improving the scalability by decreasing the data transfer volume between processors at the cost of the extra memory usage.The 2.5D matrix multiplication algorithm in this paper firstly divides the initial processes into c groups,stores matrix A and B,and perform 1/c PUMMA algorithm simultaneously on each process group by utilizing the extra memory on the computing nodes,and finally gets the final result of matrix multiplication through MPI communications.Based on BLACS(Basic Linear Algebra Communication Subprograms),this paper implements a new data redistribution algorithm from 2D to 2.5D,combined with the PUMMA algorithm,and finally gets the 2.5D PUMMA algorithm which can directly replace PDGEMM(Parallel Double-precision General Matrix-matrix Multiply).Compared with classic 2D algorithms such as PDGEMM in ScaLAPACK(Scalable Linear Algebra PACKage),the algorithm in this paper reduces the number of communications,improves data locality,and has better scalability.The performance experiments are carried out on a supercomputer system with 300 computing nodes,each of which consists of two 10-core Xeon E5-2692 v3 CPUs.These nodes are interconnected by Tianhe 2 interconnection networks(TH-Express 2)with fat tree structure.The effectiveness and efficiency of

关 键 词:2.5D并行矩阵乘算法 SCALAPACK PUMMA矩阵乘算法 SUMMA算法 分布式并行 

分 类 号:TP301[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象