面向SW26010P的异形矩阵乘法众核并行优化技术研究

Performance Optimization Techniques of Irregular-Shaped Matrix Multiplication on SW26010P

作　　者：胡怡陈道琨杨超 HU Yi;CHEN Daokun;YANG Chao(School of Mathematical Sciences,Peking University,Beijing 100871,China;Research Center of Advanced Computing,Changsha Institute for Computing and Digital Economy,Peking University,Changsha 410205,China;Laboratory of Parallel Software and Computational Science,Institute of Software,Chinese Academy of Sciences,Beijing 100190,China)

机构地区：[1]北京大学数学科学学院,北京100871 [2]北京大学长沙计算与数字经济研究院先进计算研究中心,长沙410205 [3]中国科学院软件研究所并行软件与计算科学实验室,北京100190

出　　处：《计算机工程与应用》2025年第6期150-163,共14页Computer Engineering and Applications

摘　　要：矩阵乘法广泛应用于科学与工程计算领域,是基础线性代数库中的关键优化对象。随着人工神经网络、计算流体力学等领域的快速发展,异形(irregular-shaped)矩阵乘法正在迅速引起关注。研究集中在针对国产新一代神威超级计算机采用的SW26010P众核处理器,探讨异形矩阵乘法的众核并行优化技术。具体而言,结合SW26010P的硬件特性和异形矩阵的数据布局,设计了多样化任务划分映射的并行算法,提高直接内存访问(direct memory access,DMA)访存带宽利用率。结合SW26010P的硬件流水线和向量化访存/计算指令,抽象运算中涉及的计算类型进行底层汇编优化,提高了计算效率。提出了远程内存访问(remote memory access,RMA)点对点机制下的数据共享策略,降低数据访存和传输开销,并提出了嵌套双缓冲技术进一步提高异形矩阵乘法的性能。此外,针对不同种类异形矩阵乘法行实现时面临的分块参数适配问题,基于SW26010P众核处理器进行实验分析研究,确定了各函数并行化时的最优分块参数。实验结果显著,所优化的异形矩阵乘法的性能最高可达roofline模型预测性能上限的93%,相较于常规大规模矩阵乘法算法平均获得了5.43倍的性能加速,最高可获得51.5倍的性能加速。Matrix multiplication is widely used in the field of scientific and engineering computing,and is the most important optimization object in BLAS.With the development of artificial neural networks,computational fluid mechanics and other fields,irregular-shaped matrix multiplication is rapidly gaining attention.This paper proposes parallelization techniques for irregular-shaped matrix multiplication on SW26010P,a domestic many-core processor deployed in the new generation Sunway supercomputer.Specifically,a parallel algorithm with diversified task partition mapping is designed to improve memory access bandwidth utilization rate based on the hardware characteristics and the data layout of matrix elements.At the same time,based on the hardware assembly lines and vectorized computation and data access instructions,the key computations are abstracted and the corresponding underlying compilation optimizations are performed to improve computational efficiency.And a data-sharing strategy under the RMA point to point communication mechanism is adopted to further reduce the overhead of data access and transmission,and the nested double buffering are used to further improve the performance.Besides,a series of experiments on SW26010P are conducted to determine the optimal number of blocks of different kinds of function parallelization calculation for the purpose of making full use of the hardware platform performance.The experimental results demonstrate that the performance of the irregular-shaped matrix multiplication optimized in this thesis can reach up to 93%of the upper bound of the theoretical performance.Compared with the massive GEMM algorithm implementation,the average performance acceleration of the irregular-shaped matrix multiplication is 5.43 times,and the optimal performance acceleration can reach up to 51.5 times.

关键词：异形矩阵乘法 SW26010P众核处理器多样化任务划分映射 RMA点对点机制嵌套双缓冲技术

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向SW26010P的异形矩阵乘法众核并行优化技术研究

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向SW26010P的异形矩阵乘法众核并行优化技术研究

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索