出 处:《计算机学报》2025年第1期19-34,共16页Chinese Journal of Computers
基 金:国家自然科学基金(62372462);湖南省自然科学基金(2023JJ40682);国防科大青年自主创新科学基金(ZK2023-13);国防科技大学高层次人才资助项目资助
摘 要:AI时代的到来对当今算力提出了双重挑战,一方面涉及推理,另一方面涉及分布式训练。将一部分分布式应用的计算任务卸载到高速网络的网卡或交换机能够潜在提升分布式应用的性能表现,并发挥网络的关键作用。如在交换机或网卡中卸载参数聚合等计算功能能够有效降低模型训练时产生的大量通信开销。基于P4语言的可编程数据平面除了使网络协议定制更加灵活外,还使得网络数据平面能够为分布式应用提供简单的网内计算服务。然而,当前典型的基于P4语言的可编程数据平面架构如协议无关交换架构(PISA)在进行矩阵运算等方面还表现得不够高效。分析该缺陷的关键原因在于:PISA架构中的超长指令字计算引擎在处理大规模并行同构计算任务时效率不高。针对上述问题,提出了一种面向AI加速的通算一体网内计算模型CAInNet。该模型在传统可编程数据平面的基础上,创新性地融合了单指令多数据流(SIMD)与多指令多数据流(MIMD)两种计算模式,使得网络设备不仅能够支持协议无关网络分组处理,还能在分组传输过程中对承载AI推理与训练的数据做网内计算。为了验证CAInNet在网内计算以及网络可编程方面的能力和效果,我们在该模型中使用带内网络遥测实现网络可视化,并部署多层感知机(MLP)模型实现基于AI的报文分类,替代传统的基于TCAM查表的路由方法。实验表明,采用机器学习推理的报文分类方法在包含5k路由表项的场景下,其准确度高达98.3%,同时节省了98.7%的存储空间,有效地解决了路由爆炸问题。与现有方法相比,将机器学习推理部署在CAInNet中不增加可编程数据平面的处理延迟,且仅消耗适量计算资源。The operation and service provision of distributed machine learning models are inseparable from computing power and network support.As Moore's Law slows down and the rate of computing power growth is much slower than the rate of I/O,near-data processing has become the inevitable choice in the post-Moore era.In short,it means moving data around as little as possible so that it can be processed on the path.As the transmission path of data,a high-speed network connects multiple computing devices together to form a system that communicates and cooperates with each other.In AI applications,high-speed networks are the bridge connecting algorithms,data,and computing power.The arrival of the era of AI poses two challenges to today's computing power.On the one hand,it involves inference and on the other hand,it involves distributed training.Offloading part of the computing tasks of distributed applications to the network cards or switches of high-speed networks can potentially improve the performance of distributed applications and play a key role in the network.Offloading computation functions such as parameter aggregation in switches or network cards can effectively reduce the large amount of communication overhead incurred during model training.The current programmable data plane based on P4 language not only makes the customization of network protocol more flexible,but also makes the network data plane provide simple in-network computing services for distributed applications.However,the typical P4 based programmable data plane architecture such as Protocol Independent Switch Architecture(PISA)is not efficient enough in matrix operations.The key reason is that the computing engine of Very Long Instruction Word(VLIW)in PISA architecture is not efficient when dealing with large-scale parallel homogeneous computing tasks.Focusing on this problem,this paper proposed a general computing in-network computing model CAInNet for AI acceleration.Based on the traditional programmable data plane,the model innovatively integrates S
关 键 词:AI硬件加速 通算一体 网内计算 可编程网络 报文分类 深度神经网络
分 类 号:TP393[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...