基于GPU架构的两层并行块Jacobi SVD算法  被引量:2

A PARALLEL TWO-TIER BLOCKED JACOBISVD ALGORITHM ON GPU

在线阅读下载全文

作  者:黄荣锋 赵永华[1] 于天禹 刘世芳 Huang Rongfeng;Zhao Yonghua;Yu Tianyu;Liu Shifang(Computer Netuork Information Center,Chinese Academy of Sciences,Beijing100190,China;Universityof Chinese Academy of Sciences,Beijing100049,China)

机构地区:[1]中国科学院计算机网络信息中心,北京100190 [2]中国科学院大学,北京100049

出  处:《数值计算与计算机应用》2022年第4期380-399,共20页Journal on Numerical Methods and Computer Applications

基  金:国家重点研发计划(2017YFB0202202);中国科学院战略性先导科技专项(XDC05000000)

摘  要:SVD(singularvaluedecomposition)广泛应用于图像处理、人脸识别、信号降噪等领域。本文基于单边JacobiSVD算法给出了块间和块内两层并行的块JacobiSVDGPU算法.为了更好地利用GPU的共享内存,块间并行通过存储矩阵列块之间的内积解决了共享内存不足的问题.此外,块间并行还通过矩阵块操作技术提高数据利用率及数据预取技术实现数据访问和数据计算的重叠.块内并行通过直接更新矩阵列块之间的内积替代了更新矩阵列块以及更新矩阵列块之后计算矩阵列块之间内积的归约操作,增加了GPU线程的利用率.另一方面,块内并行将需要多次访问的数据存储于共享内存或寄存器,减少了对全局内存的访问从而提升了算法实现性能。在NVIDIATeslaV100GPU上的数值实验结果表明,本文的算法较Cusolver库有1.8×倍的加速,较MAGMA库中最快的算法加速达2.5×倍.SVD(singular value decomposition) is wildly used in image processing,face recognition,signal processing,and other fields.In this paper,a parallel two-tier blocked Jacobi SVD GPU algorithm based on the one-sided Jacobi SVD algorithm and its effective implementation is presented.The parallel two-tier algorithm is composed of an inter-block parallel level and an intra-block parallel level.In the inter-block parallel level,the problem that the shared memory is too small to hold the matrix panels is overcome by storing the inner product of matrix panels on the shared memory instead.Besides,the matrix computation makes full use of the block operation technique to improve data reuse and the data prefetching technique to overlap the time of loading data and computing data.In the inner-block parallel level,for increasing the utilization of GPU threads,the computation of the inner product of matrix columns is avoided by updating the inner product of matrix columns parallelly.By storing data that can be reused many times on the shared memory or register files,the iterative process of intra-block parallelism level can reduce the access of the global memory,which improves the performance of our implementation.Numerical experiments on an NVIDIA Tesla V100 GPU show that the implementation of this paper is 1.8×and 2.5×times faster than the Cusolver and MAGMA libraries respectively.

关 键 词:奇异值分解 块Jacobi算法 并行算法 GPU 数据预取技术 

分 类 号:TP332[自动化与计算机技术—计算机系统结构] TP301.6[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象