基于TVM的变维批处理小矩阵乘法的加速及应用  

Accelerating Batched Matrix Multiplication for Variable Small Sizes Based on TVM and Applications

在线阅读下载全文

作  者:戴翰文 陈长波[1,2] DAI Hanwen;CHEN Changbo(Chongqing Institute of Green and Intelligent Technology,Chinese Academy of Sciences,Chongqing 400714,China;Chongqing School,University of Chinese Academy of Sciences,Chongqing 400714,China)

机构地区:[1]中国科学院重庆绿色智能技术研究院,重庆400714 [2]中国科学院大学重庆学院,重庆400714

出  处:《计算机科学》2025年第5期25-40,共16页Computer Science

基  金:国家重点研发计划(2023YFA1009402,2020YFA0712300);重庆英才计划青年拔尖项目(2021000263);重庆市院士牵头科技创新引导专项(cstc2021yszx-jcyjX0004,2022YSZX-JCX0011CSTB,CSTB2023YSZX-JCX0008)。

摘  要:很多实际应用中需要高效计算大量不同维度的小矩阵乘积,如基于图神经网络的图分类需要将多个邻接矩阵与节点特征矩阵相乘。针对现有方法无法跨不同硬件平台高效计算此类维度各异(简称变维)批处理小矩阵乘法的问题,基于深度学习编译器TVM,提出了一种可以跨平台的高效算法BVSM,通过为小矩阵特制优化模板、运用张量化批处理和分组填充等技术使得TVM可以高效进行变维批处理小矩阵乘法。在真实图分类任务数据集上的实验表明,在CPU端,BVSM相较于自动调度和调优的TVM(AnsorTVM)平均获得两倍以上加速,平均性能达到Intel MKL变维批处理矩阵乘法的95%,最高为其1.27倍;在GPU端,BVSM相较于AnsorTVM平均获得62.05倍的加速,相较于cuBLAS平均获得28.82倍的加速,相较于MAGMA的变维批处理矩阵乘法平均获得6.59倍的加速。In many practical applications,efficient computation of a large amount of small matrix products across different dimensions is required.For instance,in graph classification tasks based on graph neural networks,multiple adjacency matrices need to be multiplied with node feature matrices.To address the issue of existing methods being unable to efficiently handle batched matrix multiplication for variable and small sizes across different hardware platforms,this paper introduces a cross-platform efficient algorithm,BVSM,based on the deep learning compiler TVM.BVSM enhances the capability of TVM to efficiently perform batched matrix multiplication for small and variable sizes by utilizing customized optimization templates for small matrices,employing tensorization for batching,and applying grouped padding techniques.Experiments on real datasets of graph classification demonstrate that,on the CPU,BVSM achieves on average over two times speedup compared to auto-scheduled and auto-tuned TVM(AnsorTVM),reaching 95%of the average performance of,and achieving up to 1.27 times compared to,the method of batched matrix multiplication for variable sizes of Intel MKL.On the GPU,BVSM achieves on average over 62.05 times speedup compared to AnsorTVM,over 28.82 times speedup compared to cuBLAS,and over 6.59 times speedup compared to the method of batched matrix multiplication for variable sizes of MAGMA.

关 键 词:TVM 批处理矩阵乘法 变维矩阵乘法 

分 类 号:TP302[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象