面向GPU平台的并行结构化稀疏三角方程组求解器被引量：1

Parallel Structured Sparse Triangular Solver for GPU Platform

作　　者：陈道琨杨超刘芳芳[1,2] 马文静 CHEN Dao-Kun;YANG Chao;LIU Fang-Fang;MA Wen-Jing(Laboratory of Parallel Software and Computational Science,Institute of Software,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China;School of Mathematical Sciences,Peking University,Beijing 100871,China)

机构地区：[1]中国科学院软件研究所、并行软件与计算科学实验室,北京100190 [2]中国科学院大学,北京100049 [3]北京大学数学科学学院,北京100871

出　　处：《软件学报》2023年第11期4941-4951,共11页Journal of Software

基　　金：国家重点研发计划高性能计算重点专项(2020YFB0204601)。

摘　　要：稀疏三角线性方程组求解(SpTRSV)是预条件子部分的重要操作,其中结构化SpTRSV问题,在以迭代方法求解偏微分方程组的科学计算程序中,是一种较为常见的问题类型,而且通常是科学计算程序的需要解决的一个性能瓶颈.针对GPU平台,目前以CUSPARSE为代表的商用GPU数学库,采用分层调度(level-scheduling)方法并行化SpTRSV操作.该方法不仅预处理耗时较长,而且在处理结构化SpTRSV问题时会出现较为严重GPU线程闲置问题.针对结构化SpTRSV问题,提出一种面向结构化SpTRSV问题的并行算法.该算法利用结构化SpTRSV问题的特殊非零元分布规律进行任务划分,避免对输入问题的非零元结构进行预处理分析.并对现有分层调度方法的逐元素处理策略进行改进,在有效缓解GPU线程闲置问题的基础上,还隐藏了部分矩阵非零元素的访存延迟.还根据算法的任务划分特点,采用状态变量压缩技术,显著提高算法状态变量操作的缓存命中率.在此基础上,还结合谓词执行等GPU硬件特性,对算法实现进行全面的优化.所提算法在NVIDIA V100 GPU上的实测性能,相比CUSPARSE平均有2.71倍的加速效果,有效访存带宽最高可达225.2 GB/s.改进后的逐元素处理策略,配合针对GPU硬件的一系列调优手段,优化效果显著,将算法的有效访存带宽提高了约1.15倍.Sparse triangular solver(SpTRSV)is a vital operation in preconditioners.In particular,in scientific computing program that solves partial differential equation systems iteratively,structured SpTRSV is a common type of issue and often a performance bottleneck that needs to be addressed by the scientific computing program.The commercial mathematical libraries tailored to the graphics processing unit(GPU)platform,represented by CUSPARSE,parallelize SpTRSV operations by level-scheduling methods.However,this method is weakened by time-consuming preprocessing and serious GPU thread idle when it is employed to deal with structured SpTRSV issues.This study proposes a parallel algorithm tailored to structured SpTRSV issues.The proposed algorithm leverages the special non-zero element distribution pattern of structured SpTRSV issues during task allocation to skip the preprocessing and analysis of the non-zero element structure of the input issue.Furthermore,the element-wise operation strategy used in the existing level-scheduling methods is modified.As a result,the problem of GPU thread idle is effectively alleviated,and the memory access latency of some non-zero elements in the matrix is concealed.This study also adopts a state variable compression technique according to the task allocation characteristics of the proposed algorithm,significantly improving the cache hit rate of the algorithm in state variable operations.Additionally,several hardware features of the GPU,including predicated execution,are investigated to comprehensively optimize algorithm implementation.The proposed algorithm is tested on NVIDIA V100 GPU,achieving an average 2.71×acceleration over CUSPARSE and a peak effective memory-access bandwidth of 225.2 GB/s.The modified element-wise operation strategy,combined with a series of other optimization measures for GPU hardware,attains a prominent optimization effect by yielding a nearly 115%increase in the effective memory-access bandwidth of the proposed algorithm.

关键词：稀疏三角线性方程组求解(SpTRSV) 模板计算结构化网格 GPU 异构并行算法

分类号：TP30[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向GPU平台的并行结构化稀疏三角方程组求解器被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向GPU平台的并行结构化稀疏三角方程组求解器 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

面向GPU平台的并行结构化稀疏三角方程组求解器被引量：1