基于“天河二号”聚合通信卸载特性的MPI_Barrier优化  

Optimization of MPI_Barrier based on the offloading characteristics of Tianhe-2

作  者:朱琦 戴艺[1] 彭晋韬 谢旻[1] 梁崇山 刘鹏 杨博[1] 刘杰[1,2,3] ZHU Qi;DAI Yi;PENG Jintao;XIE Min;LIANG Chongshan;LIU Peng;YANG Bo;LIU Jie(College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;Hunan Key Laboratory of Digitizing Software for Frontier Equipment,National University of Defense Technology,Changsha 410073;National Key Laboratory of Parallel and Distributed Computing,National University of Defense Technology,Changsha 410073,China)

机构地区:[1]国防科技大学计算机学院,湖南长沙410073 [2]国防科技大学高端装备数字化软件湖南省重点实验室,湖南长沙410073 [3]国防科技大学并行与分布计算全国重点实验室,湖南长沙410073

出  处:《计算机工程与科学》2025年第3期400-411,共12页Computer Engineering & Science

基  金:国家自然科学基金(62272476);国家重点研发计划(2021YFBO300101);国家自然科学基金重点项目(U22B2005);并行与分布处理国家重点实验室基金(2021-KJWPDL-08)。

摘  要:Barrier作为消息传递接口MPI程序的基本操作,是确保程序正确执行的重要机制之一。目前已有的Barrier实现方案主要存在2个缺陷:首先,节点间同步存在大量冗余的数据路径传输开销;其次,节点内同步存在大量缓存失效的情况。为解决这些性能限制,针对“天河二号”定制网络TH-Express聚合通信卸载特性,提出了基于GLEX NIC的Barrier加速和共享内存标志位重排列2种优化技术,有效减少了节点间同步开销,提高了节点内基于共享内存的同步效率。基于上述优化方法,重新设计了MPI_Barrier算法,并将其集成到MPI通信库中,并在国家超级计算长沙中心通过运行微基准测试程序和实际应用程序对所提优化方法进行性能测试,规模达到7168个节点。实验结果表明,优化后的MPI_Barrier集合操作获得了1.3~14.5倍的加速,并在应用级真实负载评测中,性能提升高达54%。Barrier,as a fundamental operation in message passing interface(MPI)programs,is one of the critical mechanisms ensuring the correct execution of programs.Existing Barrier implementation schemes primarily suffer from two defects:firstly,there is significant redundant data path transmission overhead during inter-node synchronization;secondly,there are numerous cache misses during intra-node synchronization.To address these performance limitations,this paper proposes two optimization techniques tailored for the aggregate communication offload features of the Tianhe-2 customized network,TH-Express:Barrier acceleration based on GLEX NIC and shared memory flag bits rearrangement.These techniques effectively reduce the synchronization overhead between nodes and improve the synchronization efficiency within nodes based on shared memory.Based on the aforementioned optimization methods,this paper redesigns the MPI_Barrier algorithm and integrates it into the MPI communication library.Performance tests of the proposed scheme are conducted on micro-benchmark programs and real applications running on the National Supercomputing Center in Changsha,with a scale of up to 7168 nodes.Experimental results show that the optimized MPI_Barrier collective operation achieves a speedup ranging from 1.3 to 14.5 times,and in application-level real-load evaluations,the performance improvement reaches up to 54%.

关 键 词:MPI BARRIER 大规模并行应用 NIC聚合通信卸载 

分 类 号:TP301[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象