矩阵乘法的GPU并行计算时耗模型与最优配置方法

Time Cost Model and Optimal Configuration Method for GPU Parallel Computation of Matrix Multiplication

作　　者：雷超刘江[1,2] 宋佳文 LEI Chao;LIU Jiang;SONG Jiawen(Chongqing Institute of Green and Intelligent Technology,Chinese Academy of Sciences,Chongqing 400714,China;Chongqing School,University of Chinese Academy of Sciences,Chongqing 400714,China;Research Institute of Aerospace Technology,Central South University,Changsha 410017,China)

机构地区：[1]中国科学院重庆绿色智能技术研究院,重庆400714 [2]中国科学院大学重庆学院,重庆400714 [3]中南大学航空航天技术研究院,长沙410017

出　　处：《计算机科学》2024年第S01期810-817,共8页Computer Science

基　　金：国家重点研发计划(2018YFC0116704);中国科学院科技服务网络计划区域重点项目(KFJ-STS-QYZD-2021-01-001);中南大学课题(大规模稀疏线性方程组并行加速求解研究);雷达资料同化关键技术及数值预报客观订正技术研究(E190600801)。

摘　　要：水平矩阵乘竖直矩阵是科学计算及工程领域中的基本计算之一,很大程度上影响了整个算法的计算效率。GPU并行计算是迄今主流的并行计算方式之一,其底层设计使得GPU非常契合于大规模矩阵计算。迄今已经有许多研究基于GPU并行计算框架,针对矩阵的结构设计、优化矩阵乘法,但尚未有针对水平矩阵乘竖直矩阵的GPU并行算法及优化。此外,GPU核函数配置直接影响计算效率,但迄今针对最优核函数配置的研究极为有限,通常需要研究人员针对具体算法的计算特点启发式地设置。基于GPU的线程、内存模型,设计了一种并行水平矩阵乘竖直矩阵乘法PHVM。数值实验结果表明,在左乘矩阵的水平维度远远大于竖直维度时,PHVM要显著优于NVIDIAcuBLAS库中的通用矩阵乘法。进一步,基于GPU的硬件参数,建立了PHVM运行时间的核函数配置最优化理论模型。数值实验结果表明,该理论模型较为准确地描述了PHVM算法运行时间随核函数配置(网格大小、线程块大小)变换的变化趋势,且模型得出的理论最优核函数配置与实际最优运行核函数配置相符。Horizontal matrix&vertical matrix multiplication(HVM)is one of the fundamental calculations in scientific computing and engineering,as it largely affects the computational efficiency of higher-levet algorithms.GPU parallel computing has become one of the mainstream parallel computing method,and its underlying design makes it highly suitable for large-scale multiplication calculations.Numerous studies have focused on designing matrix structures and optimizing matrix multiplication using GPU parallel computing frameworks.However,there has been a lack of GPU parallet algorithms and optimization methods specifically targeting HVM.Furthermore,the configuration of GPU kernel functions directly affects computational efficiency,but studies on the optimal configuration of kernel functions have been extremely limited,typically requiring researchers to heuristi-cally set them based on the specific computational characteristics of the algorithm.This paper designs a parallel HVM algorithm,PHVM,based on the GPU’s thread and memory model.The numerical experimental results show that when the horizontal dimension of the left matrix is much larger than the vertical dimension,PHVM significantly outperforms the general matrix multiplication in the NVIDIA cuBLAS library.Furthermore,this paper establishes an optimal theoretical model for kernel function configuration of PHVM runtime based on GPU hardware parameters.The numerical experimental results indicates that this theoretical model accurately reflects the trend of changes in PHVM algorithm runtime with kernel function configuration(grid size and thread block size)variations.

关键词：矩阵乘法 GPU CUDA 核函数配置

分类号：TP391.9[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

矩阵乘法的GPU并行计算时耗模型与最优配置方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

矩阵乘法的GPU并行计算时耗模型与最优配置方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索