低资源集群中的大语言模型分布式推理技术被引量：1

Accelerating Distributed Inference of Large Language Models in Low-Resource Clusters

作　　者：冯文佼李宗航虞红芳[1] FENG Wenjiao;LI Zonghang;YU Hongfang(University of Electronic Science and Technology of China,Chengdu 611731,China)

机构地区：[1]电子科技大学,成都611731

出　　处：《中兴通讯技术》2024年第2期43-49,共7页ZTE Technology Journal

摘　　要：探索了一种并行能力更强、具有更好兼容性的大语言模型(LLM)分布式推理范式。该范式专为弱算力、小显存环境设计。同时面向主机内外差异带宽,设计了基于通信树的高效All-Reduce组通信技术;针对小显存集群,设计了细粒度的显存管理与调度技术。最后,基于这些关键技术,构建了一套针对资源受限场景的LLM推理软件系统,旨在用数量有限的低资源设备,最大化能推理的LLM,同时通过优化通信策略与计算调度加速分布式推理。实验证明,在应用上述技术后,本方案的首词元生成延迟降低34%~61%,每秒生成词元吞吐量提升52%~150%,显存占用降低61%。A distributed inference paradigm for large language model(LLM)with stronger parallelism and better compatibility is explored,which is designed for weak computing power and small memory environments.Meanwhile,an efficient All-Reduce group communication technique based on communication tree is designed for the different bandwidths inside and outside the host,and a fine-grained memory management and scheduling technique is designed for small memory clusters.Finally,based on these key techniques,a set of LLM infer⁃ence software system for resource-constrained scenarios is constructed,aiming to maximize the LLMs that can be inferenced with a lim⁃ited number of low-resource devices,and at the same time accelerating the distributed inference by optimizing the communication strategy and computation scheduling.Experiments demonstrate that after applying the above techniques,the first lexical element generation latency is reduced by 34%~61%,the lexical element generation throughput per second is increased by 52%~150%,and the memory occupation is re⁃duced by 61%.

关键词：LLM分布式推理范式资源受限场景优化通信策略与计算调度

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

低资源集群中的大语言模型分布式推理技术被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

低资源集群中的大语言模型分布式推理技术 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

低资源集群中的大语言模型分布式推理技术被引量：1