CPU+GPU异构并行的矩阵转置算法研究  被引量:3

Research on matrix transpose algorithm of CPU+GPU heterogeneous parallelism

在线阅读下载全文

作  者:肖汉 李彩林[3] 李琦[1] 周清雷 XIAO Han;LI Cai-lin;LI Qi;ZHOU Qing-lei(School of Information Science and Technology,Zhengzhou Normal University,Zhengzhou 450044,China;School of Information Engineering,Zhengzhou University,Zhengzhou 450001,China;School of Civil and Architectural Engineering,Shandong University of Technology,Zibo 255000,China)

机构地区:[1]郑州师范学院信息科学与技术学院,河南郑州450044 [2]郑州大学信息工程学院,河南郑州450001 [3]山东理工大学建筑工程学院,山东淄博255000

出  处:《东北师大学报(自然科学版)》2019年第4期70-77,共8页Journal of Northeast Normal University(Natural Science Edition)

基  金:国家自然科学基金资助项目(61572444,61250007,41601496,41701525);山东省自然科学基金资助项目(ZR2017LD002)

摘  要:针对当前算法优化研究一般局限于单一硬件平台、很难实现在不同平台上高效运行的问题,利用图形处理器(GPU)提出了基于开放式计算语言(OpenCL)的矩阵转置并行算法.通过矩阵子块粗粒度并行、矩阵元素细粒度并行、工作项与数据的空间映射和本地存储器优化方法的应用,使矩阵转置算法在GPU计算平台上的性能提高了12倍.实验结果表明,与基于CPU的串行算法、基于开放多处理(OpenMP)并行算法和基于统一计算设备架构(CUDA)并行算法性能相比,矩阵转置并行算法在OpenCL架构下NVIDIA GPU计算平台上分别获得了12.26,2.23和1.50的加速比.该算法不仅性能高,而且实现了在不同计算平台间的性能移植.The current most of the researches on algorithm optimization are aimed at a single hardware platform,it is difficult to achieve efficient running on different platforms.In this paper,a parallel algorithm of matrix transpose based on Open Computing Language(OpenCL)is presented using the advantages of the Graphic Processing Unit(GPU).The performance of the matrix transpose algorithm on the GPU computing platform is improved by 12 times through coarse-grained parallelism of the matrix sub-blocks,fine-grained parallelism of the matrix elements,spatial mapping between work-item and data,and optimization method local memory.The experimental results show that compared with the performance of the serial algorithm based on CPU,parallel algorithm based on Open Multi-Processing(OpenMP)and parallel algorithm based on Compute Unified Device Architecture(CUDA),the matrix transpose parallel algorithm obtains 12.26 times,2.23 times and 1.50 times speedup in the NVIDIA GPU computing platform under the OpenCL architecture respectively.The algorithm not only achieves high performance,but also realizes the performance portability among different computing platforms.

关 键 词:矩阵转置 图形处理器 开放式计算语言 并行算法 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象