面向异构众核架构的块Gauss-Seidel/Jacobi预条件算法  

A Block Gauss-Seidel/Jacobi Preconditioner for Heterogeneous Many-Core Architecture

在线阅读下载全文

作  者:吴立垒 陈荣亮[2] 罗力[2] 闫争争 廖子菊 迟利华 刘杰[1] WU Li-Lei;CHEN Rong-Liang;LUO Li;YAN Zheng-Zheng;LIAO Zi-Ju;CHI Li-Hua;LIU Jie(Science and Technology on Parallel and Distributed Processing Laboratory,National University of Defense Technology,Changsha 410073;Shenzhen Institutes of Advanced Technology,Chinese Academy of Sciences,Shenzhen,Guangdong 518055;Institute of Advanced Science and Technology,Hunan Institute of Traffic Engineering,Xiangyin,Hunan 414600)

机构地区:[1]国防科技大学并行与分布处理国家重点实验室,长沙410073 [2]中国科学院深圳先进技术研究院,广东深圳518055 [3]湖南交通工程学院高科技研究院,湖南湘阴414600

出  处:《计算机学报》2019年第11期2447-2460,共14页Chinese Journal of Computers

基  金:国家“九七三”重点基础研究发展规划项目基金(2017YFB0202104和2016YFB0200401);国家自然科学基金(91530324,91430218,11401580,11701547,11602282和61531166003);深圳市基础研究学科布局项目基金(JCYJ20170307165328836,JCYJ20160331193229720和JSGG20170824154458183)资助~~

摘  要:Gauss-Seidel算法作为线性方程组的求解器,在并行计算领域具有广泛应用,而面向异构众核架构开发其细粒度并行性一直是具有挑战性的问题.针对非结构网格问题,基于代数分块并行思路提出了面向异构众核架构的块Gauss-Seidel/Jacobi算法,将其作为区域分解算法的子区域求解器.面向神威太湖之光超级计算机的异构众核架构,设计并实现了该算法.为充分利用神威太湖之光国产SW26010芯片中每个CPE拥有的高速LDM(Local Data Memory),缓解通信瓶颈,设计了多行块通信打包、计算与通信重叠性能优化策略和丢弃非关键元素的低通信复杂性数值优化方法.数值实验结果显示,相较于串行Gauss-Seidel算法,优化后的块Gauss-Seidel/Jacobi算法预处理过程加速比最高可达到4.16倍.以1040核的测试数据为基准,在处理器核数达到33280时,块Gauss-Seidel/Jacobi预条件算法的并行效率达到61%.The Gauss-Seidel algorithm is widely used in the field of parallel computing as an iterative solver for linear system.However,it is challenging to exploit its fine-level parallelism on the emerging heterogeneous many-core architectures.For unstructured mesh problems,the missing of geometric information in matrices of unstructured mesh problems made the classical geometry-based parallel strategies fail.Instead,a block Gauss-Seidel/Jacobi algorithm for heterogeneous many-core architectures based on the algebra-based parallel strategy is proposed,which is used as the subdomain solver(preconditioner)in the DDM(Domain Decomposition Method).To balance between the parallel and numerical efficiencies,we combined the inherent high parallelism of Jacobi method and the good numerical converge rate of Gauss-Seidel method together to form the new algorithm.The block Gauss-Seidel/Jacobi algorithm can be regarded as small Gauss-Seidel iterations on the intra-thread level and a global Jacobi iteration on the inter-thread level.The proposed algorithm presents scalable parallelism with minimum inter-thread communication.Moreover,the unknowns were ordered in a node-by-node manner,meaning that all the N unknowns belongs to a mesh node are coupled to form a N×N block.Such ordering of unknowns does not affect the numerical performance but improves the convergence and scalability of our solver.For the problem solving the incompressible Navier-Stokes equations,the subdomain solver takes a 4×4 node block as a minimum unit.The Sunway TaihuLight supercomputer,consisted of 40960 homegrown SW26010 many-core processors with totally over 10 million cores,is exactly based on heterogeneous many-core architecture.In this paper,the block Gauss-Seidel/Jacobi algorithm was implemented and optimized based on SW26010 many-core architecture.It is known that communication is the bottleneck of heterogeneous many-core architecture.Therefore,a series of low-communication-complexity optimization strategies were designed to achieve a better numerical perfo

关 键 词:非结构网格 异构众核架构 区域分解算法 块Gauss-Seidel/Jacobi算法 神威太湖之光 

分 类 号:TP301[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象