面向众核CPU的稠密线性求解器性能评测与优化

Dense linear solver on many-core CPUs:Characterization and optimization

作　　者：付晓苏醒董德尊[1] 钱程东 FU Xiao;SU Xing;DONG De-zun;QIAN Cheng-dong(College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;Phytium Technology Co.,Ltd.,Tianjin 300459,China)

机构地区：[1]国防科技大学计算机学院,湖南长沙410073 [2]飞腾信息技术有限公司,天津300459

出　　处：《计算机工程与科学》2024年第6期984-992,共9页Computer Engineering & Science

基　　金：国家重点研发计划(2022YFB4501702);湖南省自然科学基金杰出青年基金(2021JJ10050)。

摘　　要：稠密线性求解器在高性能计算和机器学习等领域扮演着重要的角色。其典型的并行算法实现通常构建在著名的fork-join或task-based编程模型之上。尽管采用fork-join模型的主流稠密线性代数库能将大部分的计算转移到高度优化、高性能的BLAS 3例程上,由于fork-join不灵活的执行流,它们仍然未能高效地利用众核CPU的计算资源。采用task-based编程模型的开源库能实现更加灵活、负载更均衡的算法,因此能获得明显的性能提升。然而,在众核CPU平台上,尤其是对于中等矩阵规模的问题而言,它们仍然有较大的优化空间。对稠密线性求解器的性能进行了全面的测评,以定位性能瓶颈,并提出了2种优化策略,以提高程序性能。具体地,通过重叠LU分解和下三角求解的计算过程,减少同步开销线程的空等,从而提高算法的并行性;进一步通过减少冗余的矩阵打包操作,降低算法的访存开销。分别在2个主流的众核CPU平台(Intel®Xeon Gold®6252N(48核)和HiSilicon Kunpeng 920(64核))上进行了性能评估。实验结果表明,该优化的稠密线性求解器在上述两个CPU平台上,相比最佳开源实现分别取得了10.05%(Xeon)和13.63%(Kunpeng 920)的性能提升。The dense linear solver plays a vital role in high-performance computing and machine learning.Typical parallel implementations are built upon the well-known fork-join or task-based programming model.Though mainstream dense linear algebra libraries adopting the fork-join paradigm can shift most of the computation to well-tuned and high-performance BLAS 3 routines,they fail to exploit many-core CPUs efficiently due to the rigid execution stream of fork-join.While open-source implementations employing the task-based paradigm can provide more promising performance thanks to the model s malleability and better load balance,they still leave much room for optimization on many-core platforms,especially for medium-sized matrices.In this paper,a quantitative characterization of the dense linear solver is carried out to locate performance bottlenecks and a series of optimizations is proposed to deliver higher performance.Specifically,idle threads are reduced by merging LU factorization with the following lower triangular solver to improve parallelism.Moreover,duplicated matrix packing operations are reduced to lower memory overhead.Performance evaluation is conducted on two modern many-core platform,Intel®Xeon Gold®6252N(48 cores)and HiSilicon Kunpeng 920(64 cores).Evaluation results show that our optimized solver outperforms the state-of-the-art open-source implementation by a factor up to 10.05%(Xeon)and 13.63%(Kunpeng 920)on the two platforms,respectively.

关键词：稠密线性求解器 LU分解 fork-join模型 task-based模型众核CPU

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向众核CPU的稠密线性求解器性能评测与优化

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向众核CPU的稠密线性求解器性能评测与优化

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索