检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:黄荣锋 刘世芳 赵永华[1] HUANG Rongfeng;LIU Shifang;ZHAO Yonghua(Computer Network Information Center,Chinese Academy of Sciences,Beijing 100080,China;University of Chinese Academy of Sciences,Beijing 100080,China)
机构地区:[1]中国科学院计算机网络信息中心,北京100080 [2]中国科学院大学,北京100080
出 处:《计算机科学》2023年第4期397-403,共7页Computer Science
基 金:国家重点研发计划(2017YFB0202202);中国科学院战略性先导科技专项(XDC05000000)。
摘 要:批量矩阵计算问题广泛存在于科学计算与工程应用领域。随着性能的快速提升,GPU已成为解决这类问题的重要工具之一。矩阵特征值分解属于双边分解,需要使用迭代算法进行求解,不同矩阵的迭代次数可能不同,因此,在GPU上设计批量矩阵的特征值分解算法比设计LU分解等单边分解算法更具挑战性。文中针对不同规模的矩阵,基于Jacobi算法设计了相应的批量厄米矩阵特征值分解GPU算法。对于共享内存无法存储的矩阵,采用矩阵“块”操作技术提升计算强度,从而提高GPU的资源利用率。所提算法完全在GPU上运行,避免了CPU与GPU之间的通信。在算法实现上,通过kernel融合减少了kernel启动负载和全局内存访问。在V100 GPU上的实验结果表明,所提算法优于已有工作。Roofline性能分析模型表明,文中给出的实现已接近理论上限,达到了4.11TFLOPS。Batched matrix computing problems are widely existed in scientific computing and engineering applications.With rapid performance improvements,GPU has become an important tool to solve such problems.The eigenvalue decomposition belongs to the two-sided decomposition and must be solved by the iterative algorithm.Iterative numbers for different matrices can be varied.Therefore,designing eigenvalue decomposition algorithms for batched matrices on the GPU is more challenging than designing batched algorithms for the one-sided decomposition,such as LU decomposition.This paper proposes batched algorithms based on the Jacobi algorithms for eigenvalue decomposition of Hermitian matrices.For matrices that cannot reside in shared memory wholly,the block technique is used to improve the arithmetic intensity,thus improving the use of GPU resources.Algorithms presented in this paper run completely on the GPU,avoiding the communication between the CPU and GPU.Kernel fusion is adopted to decrease the overhead of launching kernel and global memory access.Experimental results on V100 GPU show that our algorithms are better than existing works.Performance evaluation results of the Roofline model indicate that our implementations are close to the upper bound,approaching 4.11TFLOPS.
关 键 词:厄米矩阵 特征值分解 批量计算 Roofline模型 性能分析
分 类 号:TP301[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.144.199.9