稀疏矩阵向量乘法在申威众核架构上的性能优化  被引量:13

Performance Optimization for Sparse Matrix-Vector Multiplication on Sunway Architecture

在线阅读下载全文

作  者:李亿渊 薛巍[1,2] 陈德训 王欣亮 许平 张武生 杨广文[1,2] LI Yi-Yuan;XUE Wei;CHEN De-Xun;WANG Xin-Liang;XU Ping;ZHANG Wu-Sheng;YANG Guang-Wen(Department of Computer Science and Technology,Tsinghua University,Beijing 100084;National Supercomputing Center in Wuxi,Wuxi,Jiangsu 214072)

机构地区:[1]清华大学计算机科学与技术系,北京100084 [2]国家超级计算无锡中心,江苏无锡214072

出  处:《计算机学报》2020年第6期1037-1051,共15页Chinese Journal of Computers

基  金:国家电网公司科技项目“适应于电力系统应用的高性能计算技术研究与开发”(合同号:XT71-19-022)资助.

摘  要:计算机数值模拟是现代科学和技术发展的重要触发力量.在数值模拟中,求解大规模稀疏线性方程组是非常重要的一个环节.迭代求解过程中稀疏矩阵向量乘法是耗时最长的计算核心之一,存在严重的数据局部性差、写冲突、负载不均衡等问题.因此,稀疏矩阵向量乘法已经成为了当前性能优化的难点和研究热点.本文面向国产众核处理器架构,以申威26010国产众核处理器为平台,针对稀疏矩阵向量乘法,在线程级和指令级并行层面上进行细粒度的并行算法设计和优化实现.其核心思想是,将众核架构设计精巧的矩阵分层分块技术用于矩阵存储、访问和任务调度,在保证右端向量数据复用的同时有效实现了负载均衡,避免了申威26010上因频繁缓存判断和细粒度访问导致的潜在性能问题.通过对SuiteSparse矩阵集合中的2710个算例的测试,该算法可以获得与主核上的串行算法相比11.7倍的平均加速和55倍的最高加速.Numerical simulation is an important trigger for the development of modern science and technology.Meanwhile,solving large-scale linear systems with the iterative methods is widely used in nowadays numerical simulations and is regarded as one of the most time-consuming components.During the solving process,the sparse matrix-vector multiplication(Ax=b,where A is a sparse matrix and both x and b are vectors;abbreviated as SpMV)is one of the most critical computing kernels.The inherent issues of SpMV,such as the poor data-locality,write-conflict,and load-imbalance,hinder the high performance achieved.Moreover,the design features of recent Chinese home-grown many-core processor Shenwei 26010,including cache-less and inefficient fine-grained memory access,make the implementation optimization of SpMV even more difficult.This work focuses on developing high-performance sparse matrix-vector multiplication algorithm,well exploiting the parallelism of thread-level and effectively resolving the issues mentioned on Shenwei 26010.In order to achieve high performance of SpMV on Shenwei 26010,this paper designed a many-core SpMV algorithm based on hierarchical blocking.First,the matrix is divided into several matrix bands by rows.Every matrix band is assigned to at most one row of CPE cluster of SW26010 for calculation according to greedy strategy.This task allocation scheme can ensure load-balance across different rows of CPE cluster as much as possible.Second,the CPEs in the same row share the value of the vector b through register-level communication,and only the selected leading CPE writes the result back to memory.This method solves the problem of write-conflict.Third,each matrix band is equally divided into several sub-matrices according to the number of non-zero elements.And the sub-matrices are assigned to CPEs in the same row of CPE cluster for calculation.In this way,the load-balance across CPEs is also guaranteed.Finally,the sub-matrices are further divided into several small matrix blocks due to the limited size of t

关 键 词:申威众核处理器 并行计算 矩阵向量乘法 矩阵格式 稀疏矩阵计算 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象