检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:黄荣锋 赵永华[1] 于天禹 刘世芳 Huang Rongfeng;Zhao Yonghua;Yu Tianyu;Liu Shifang(Computer Netuork Information Center,Chinese Academy of Sciences,Beijing100190,China;Universityof Chinese Academy of Sciences,Beijing100049,China)
机构地区:[1]中国科学院计算机网络信息中心,北京100190 [2]中国科学院大学,北京100049
出 处:《数值计算与计算机应用》2022年第4期380-399,共20页Journal on Numerical Methods and Computer Applications
基 金:国家重点研发计划(2017YFB0202202);中国科学院战略性先导科技专项(XDC05000000)
摘 要:SVD(singularvaluedecomposition)广泛应用于图像处理、人脸识别、信号降噪等领域。本文基于单边JacobiSVD算法给出了块间和块内两层并行的块JacobiSVDGPU算法.为了更好地利用GPU的共享内存,块间并行通过存储矩阵列块之间的内积解决了共享内存不足的问题.此外,块间并行还通过矩阵块操作技术提高数据利用率及数据预取技术实现数据访问和数据计算的重叠.块内并行通过直接更新矩阵列块之间的内积替代了更新矩阵列块以及更新矩阵列块之后计算矩阵列块之间内积的归约操作,增加了GPU线程的利用率.另一方面,块内并行将需要多次访问的数据存储于共享内存或寄存器,减少了对全局内存的访问从而提升了算法实现性能。在NVIDIATeslaV100GPU上的数值实验结果表明,本文的算法较Cusolver库有1.8×倍的加速,较MAGMA库中最快的算法加速达2.5×倍.SVD(singular value decomposition) is wildly used in image processing,face recognition,signal processing,and other fields.In this paper,a parallel two-tier blocked Jacobi SVD GPU algorithm based on the one-sided Jacobi SVD algorithm and its effective implementation is presented.The parallel two-tier algorithm is composed of an inter-block parallel level and an intra-block parallel level.In the inter-block parallel level,the problem that the shared memory is too small to hold the matrix panels is overcome by storing the inner product of matrix panels on the shared memory instead.Besides,the matrix computation makes full use of the block operation technique to improve data reuse and the data prefetching technique to overlap the time of loading data and computing data.In the inner-block parallel level,for increasing the utilization of GPU threads,the computation of the inner product of matrix columns is avoided by updating the inner product of matrix columns parallelly.By storing data that can be reused many times on the shared memory or register files,the iterative process of intra-block parallelism level can reduce the access of the global memory,which improves the performance of our implementation.Numerical experiments on an NVIDIA Tesla V100 GPU show that the implementation of this paper is 1.8×and 2.5×times faster than the Cusolver and MAGMA libraries respectively.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.142.219.125