一种面向OpenCL架构的矩阵-向量乘并行算法与实现被引量：2

Matrix-vector Multiplication Parallel Algorithm and Implementation for OpenCL Architecture

作　　者：肖汉周清雷[2] 姚鹏姿[1] XIAO Han;ZHOU Qing-lei;YAO Peng-zi(School of Information Science and Technology,Zhengzhou Normal University,Zhengzhou 450044,China;School of Information Engineering,Zhengzhou University,Zhengzhou 450001,China)

机构地区：[1]郑州师范学院信息科学与技术学院,郑州450044 [2]郑州大学信息工程学院,郑州450001

出　　处：《小型微型计算机系统》2019年第1期26-30,共5页Journal of Chinese Computer Systems

基　　金：国家自然科学基金项目(61572444;61250007)资助

摘　　要：矩阵-向量乘法算法的时间复杂度大,传统计算方法的实时性和跨平台性难以保证.本文提出一种基于开放式计算语言(Open Computing Language,OpenCL)的矩阵-向量乘并行算法,矩阵-向量乘法过程被分解成若干具有不同粒度的子任务.根据相应的并行度,每个工作组进行矩阵中的行块与列向量的乘积,每个工作项进行行块中行向量与列向量的乘积,并把计算任务分别分配到计算单元和处理单元进行处理.实验结果表明,与基于CPU的串行算法、基于OpenMP并行算法和基于统一计算设备架构(Compute Unified Device Architecture,CUDA)并行算法性能相比,矩阵-向量乘并行算法在OpenCL架构下NVIDIA图形处理器(Graphic Processing Unit,GPU)计算平台上分别获得了20. 86倍、6. 39倍和1. 49倍的加速比.验证了提出的并行优化方法的有效性和性能可移植性.The time complexity of matrix-vector multiplication algorithm is large,and the real-time and cross-platform performance of traditional computing methods is difficult to guarantee. This paper presents a matrix-vector multiplication parallel algorithm based on Open Computing Language( OpenCL),and the matrix-vector multiplication process is decomposed into several subtasks with different granularity. According to the corresponding degree of parallelism,each work-group carries on the product of the rowblock in the matrix and the column vector,each work-item carries on the product of the rowvector in the rowblock and the column vector,and assigns the computation task separately to the compute unit and the processing element for processing. The experimental results showthat compared with the performance of the serial algorithm based on CPU,parallel algorithm based on OpenMP and parallel algorithm based on Compute Unified Device Architecture( CUDA),the matrix-vector multiplication parallel algorithm obtains 20. 86 times,6. 39 times and 1. 49 times speedup in the NVIDIA GPU computing platform under the OpenCL architecture respectively. The validity and performance portability of the proposed parallel optimization method are verified.

关键词：矩阵-向量乘图形处理器开放式计算语言并行算法

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种面向OpenCL架构的矩阵-向量乘并行算法与实现被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种面向OpenCL架构的矩阵-向量乘并行算法与实现 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种面向OpenCL架构的矩阵-向量乘并行算法与实现被引量：2