Direct xPU:一种新型节点间通信优化的分布式异构计算架构  被引量:1

Direct xPU: A Novel Distributed Heterogeneous Computing Architecture Optimized for Inter-node Communication Optimization

在线阅读下载全文

作  者:李仁刚 王彦伟[2] 郝锐 肖麟阁 杨乐 杨广文 阚宏伟[3] Li Rengang;Wang Yanwei;Hao Rui;Xiao Linge;Yang Le;Yang Guangwen;Kan Hongwei(Department of Computer Science and Technology,Tsinghua University,Beijing 100084;Inspur(Beijing)Electronic Information Industry Co.,Ltd,Beijing 100085;Guangdong Inspur Intelligent Computing Technology Co.,Ltd,Guangzhou 510623)

机构地区:[1]清华大学计算机科学与技术系,北京100084 [2]浪潮(北京)电子信息产业有限公司,北京100085 [3]广东浪潮智慧计算技术有限公司,广州510623

出  处:《计算机研究与发展》2024年第6期1388-1400,共13页Journal of Computer Research and Development

基  金:广东省重点领域研发计划项目(2021B0101400001)。

摘  要:人工智能大模型应用的爆发式增长,使得难以依靠单一节点、单一类型的算力实现应用的规模部署,分布式异构计算成为主流选择,而节点间通信成为大模型训练或推理过程中的主要瓶颈之一.目前,主要由GPU,FPGA等头部芯片厂商所主导的各种计算架构的节点间通信方案还存在一些问题.一方面,为了追求极致的节点间通信性能,一部分架构选择使用协议简单而可扩展性差的点对点传输方案.另一方面,传统的异构计算引擎(例如GPU)虽然在内存、计算管线等算力要素方面独立于CPU,但在通信要素方面却缺少专属的网络通信设备,需要完全或部分借助于CPU通过PCIe等物理链路来处理异构计算引擎与共享网络通信设备之间的通信.所实现的Direct xPU分布式异构计算架构,使得异构计算引擎在算力要素和通信要素两方面均具有独立的、专属的设备,实现了数据的零拷贝,并进一步消除了节点间通信过程中处理跨芯片传输数据所带来的能耗和延迟.测试结果表明,Direct xPU取得了与追求极致的节点间通信性能的计算架构相当的通信延迟,带宽接近物理通信带宽的上限.The explosive growth of the application of large-scale artificial intelligence models has made it difficult to achieve the scale deployment of applications relying on a single node or a single type of computing architecture.Distributed heterogeneous computing has become the mainstream choice,and inter-node communication has become one of the main bottlenecks in the training or inference process of large models.Currently,there are still some deficiencies in the inter-node communicating solutions dominated by leading chip manufacturers.On the one hand,some architectures choose to use a simple but less scalable point-to-point transmission scheme in order to pursue the ultimate inter-node communication performance.On the other hand,traditional heterogeneous computing engines(such as GPUs)are independent of CPUs in terms of computing resources such as memory and computing cores,but they lack dedicated communicating network devices in terms of communication resources and need to rely entirely or partially on CPUs to handle transmission between heterogeneous computing engines and the shared communicating network device through physical links such as PCIe.The proposed Direct xPU distributed heterogeneous computing architecture in this article enables heterogeneous computing engines to have independent and dedicated devices in both computing and communication resources,achieving zero-copy data and further eliminating the energy consumption and latency associated with cross-chip data transfer during inter-node communication.Evaluations show that Direct xPU achieves communication latency comparable to computing architectures pursuing ultimate inter-node communication performance,with bandwidth close to the physical limit.

关 键 词:节点间通信 FPGA GPU RDMA 零拷贝 

分 类 号:TP393[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象