检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李庆 贾海鹏[2] 张云泉[2] 张思佳 LI Qing;JIA Haipeng;ZHANG Yunquan;ZHANG Sijia(School of Information Engineering,Dalian Ocean University,Dalian,Liaoning 116023,China;Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)
机构地区:[1]大连海洋大学信息工程学院,辽宁大连116023 [2]中国科学院计算技术研究所,北京100190
出 处:《计算机科学》2025年第4期291-300,共10页Computer Science
基 金:国家重点研发计划(2023YFB3001701);国家自然科学基金(62372432)。
摘 要:GEMV(通用矩阵-向量乘法函数)是BLAS(基础线性代数子程序)算法库的核心组成部分,广泛用于计算机科学、工程计算和数学计算等领域。当前,随着国产Hygon DCU版本的不断迭代升级,Hygon DCU与传统GPU生产商之间也存在一定的竞争优势;随着GEMV应用领域的不断扩大,GEMV的输入特征体现出多样化的趋势。在这种背景下,单纯靠一种优化方法,无法实现GEMV算法在GPU计算平台上所有输入情况下的高性能。因此,在访存优化、指令重排、并行规约、共享内存、线程排布等传统优化手段的基础上,提出了一种输入感知的性能自适应优化方法,其能够根据输入矩阵的不同规模和形状自动调整计算kernel的实现方式以达到最佳性能,显著提高了GEMV在Hygon DCU上的性能。实验结果表明,在Hygon DCU Z100SM上,输入感知的通用矩阵-向量乘算法的整体性能明显优于RocBLAS库中的相关算法,对于不同的矩阵输入规模,性能最大提升为RocBLAS库中对应算法的3.0203倍。GEMV(generalized matrix-vector multiplication function)is the core component of BLAS(basic linear algebra subroutine)algorithm library,which is widely used in the fields of computer science,engineering computation and mathematical computation.Currently,with the continuous iterative upgrading of the domestic Hygon DCU version,there is also a certain competitive advantage between the Hygon DCU and the traditional GPU manufacturer before.With the continuous expansion of the application field of GEMV,the input characteristics of GEMV also reflect a diversified tendency,in which case,relying on a single optimization method,it is not possible to realize the GEMV algorithm in all inputs of GPU computing platforms in the cases with high performance.Therefore,in this paper,on the basis of traditional optimization means such as access optimization,instruction rearrangement,parallel statute,shared memory,thread scheduling,we propose an input-aware performance adaptive optimization method,which is able to automatically adjust the implementation of the computation kernel according to the different sizes and shapes of the input matrices in order to achieve the optimal performance,and significantly improves the performance of GEMV on a Hygon DCU.Experimental results show that the overall performance of the generalized matrix-vector multiplication algorithm for input awareness implemented in this paper on the Hygon DCU Z100SM is significantly better than the related algorithms in the RocBLAS library,with a maximum performance improvement of 3.0203 times that of the corresponding algorithms in the RocBLAS library for different matrix input sizes.
关 键 词:通用矩阵-向量乘法 DCU 基础线性代数子程序函数库 自适应调优 性能优化
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.33