数据处理单元赋能的智算中心网络拥塞控制机制  

DPU empowered intelligent congestion control mechanism for the intelligent computing center network

作  者:陈锦前 郭少勇[1] 刘畅[1] 亓峰[1] 邱雪松[1] CHEN Jinqian;GUO Shaoyong;LIU Chang;QI Feng;QIU Xuesong(State Key Laboratory of Networking and Switching Technology,Beijing University of Posts and Telecommunications,Beijing 100876,China)

机构地区:[1]北京邮电大学网络与交换技术全国重点实验室,北京100876

出  处:《通信学报》2025年第2期1-17,共17页Journal on Communications

基  金:国家自然科学基金资助项目(No.62322103);北京市自然科学基金资助项目(No.4232009);中央高校基本科研业务费专项资金资助项目(No.2023ZCTH11)。

摘  要:针对智算中心集群间交互频繁造成网络拥塞频发导致智能业务实时性难以保障的问题,以数据处理单元(DPU)为核心载体构建了深度强化学习算法驱动的拥塞控制模型,利用剪枝与量化融合的方式对模型进行压缩,并通过知识蒸馏方法生成高效梯度增强决策树,实现调速动作与网络实时状态的精准匹配。仿真结果表明,所提机制在泛化能力和控制效果方面均优于现有方法,在多个压力测试场景中提升网络有效吞吐率与公平性指标JAIN10.8%和8.9%以上,降低P99端到端时延与丢包率17.31%和11.47%以上,降低并行计算场景下数据流传输任务完成时间11.23%以上,且具备应对网络状态突变的快速响应能力。Addressing the issue of frequent network congestion due to high-frequency interactions between intelligent computing center clusters,which compromised the real-time performance of intelligent services,a congestion control model driven by deep reinforcement learning algorithm was constructed with the data processing unit(DPU).By integrating pruning and quantization,the model was lightweighted.Moreover,the model was transformed into the efficient gradient-boosted decision tree through knowledge distillation method,allowing for precise matching of control actions with real-time network conditions.Simulation results show that the proposed mechanism is demonstrated to outperform existing methods in terms of generalization capability and control effectiveness.The network’s effective throughput and fairness index JAIN are increased by more than 10.8%and 8.9%,respectively,across various experimental scenarios.P99 end-to-end latency and packet loss rate are reduced by more than 17.31%and 11.47%,respectively.The completion time of data flow transfer tasks in parallel computing scenarios is decreased by more than 11.23%.Additionally,rapid response capabilities to sudden changes in network status are exhibited.

关 键 词:拥塞控制 多智能体深度强化学习 智算中心网络 远程直接内存访问网络 数据处理单元 

分 类 号:TP393[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象