检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:段晓东 李婕妤 程伟强 李晗 王瑞雪 王豪杰 DUAN Xiaodong;LI Jieyu;CHENG Weiqiang;LI Han;WANG Ruixue;WANG Haojie(China Mobile Research Institute,Beijing 100053,China)
机构地区:[1]中国移动通信有限公司研究院,北京100053
出 处:《电信科学》2024年第6期146-159,共14页Telecommunications Science
摘 要:AI大模型正引领下一个十年的信息与通信技术(information and communications technology,ICT)产业发展热点。智算中心网络是支撑AI大模型分布式训练的通信底座,是决定AI集群效能的关键要素之一。AI大模型的数据量和参数量不断扩张,给智算中心网络带来了严峻的挑战,同时给关键网络技术进行代际性创新带来了机遇。在AI大模型训练和推理过程中,提供数据的高性能和高安全传输是AI业务对智算中心网络的两大核心需求。高效的负载均衡、拥塞控制技术和网络安全协议是其中的关键网络技术。为应对大规模AI业务带来的严峻挑战,提出全调度以太网(global scheduled Ethernet,GSE)作为对应的解决方案,并搭建真实的测试环境对GSE和RoCE(remote direct memory access over converged Ethernet)网络进行性能对比测试。测试结果证明,GSE相较RoCE网络显著改善了任务完成时间(job completion time,JCT)。AI large model is leading the hot ICT(information and communications technology)industry in the next decade.Intelligent computing center network is a communication base to support the distributed training of AI large model,and it is one of the key factors to determine the efficiency of AI clusters.The data volume and the number of parameters of AI large model are expanding continuously,which brings the network of intelligent computing centers serious challenges,and also brings an opportunity for intergenerational innovation of key network technologies.In the process of AI large model training and inferencing,providing high performance and high security transmission of data are the two core requirements of AI business for intelligent computing network.Efficient load balancing,congestion control technologies and network security protocols are the key network technologies.To address the challenge brought by large-scale AI business,global scheduling ethernet(GSE)was proposed as a corresponding solution,and realistic test environment was built to compare the performance of GSE and RoCE.The test results show that GSE significantly improves JCT compared with RoCE network.
关 键 词:AI大模型分布式训练 全调度以太网 负载均衡 拥塞控制 网络安全协议
分 类 号:TP393[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7