基于异构硬件的LSTM训练系统

LSTM training system based on heterogeneous hardware

作　　者：黄为新胡伟方曹雪娇石宣化[1,2] HUANG Weixin;HU Weifang;CAO Xuejiao;SHI Xuanhua(School of Computer Science and Technology,Huazhong University of Science and Technology,Wuhan 430074,China;National Engineering Research Center for Big Data Technology and System,Services Computing Technology and System Lab,Huazhong University of Science and Technology,Wuhan 430074,China)

机构地区：[1]华中科技大学计算机科学与技术学院,湖北武汉430074 [2]华中科技大学大数据技术与系统国家地方联合工程研究中心,服务计算技术与系统教育部重点实验室,湖北武汉430074

出　　处：《大数据》2024年第4期172-188,共17页Big Data Research

基　　金：新一代人工智能国家科技重大专项(No.2020AAA0108501);湖北省重大攻关项目(JD)(No.2023BAA024)。

摘　　要：在大数据时代,以LSTM为代表的深度神经网络模型具有处理海量数据的能力,在语言处理、语音识别、时序数据预测等领域表现优异。随着模型复杂度的提高,训练成本大幅提升。现有的LSTM训练系统使用了算子融合、多流等加速手段,但忽略了训练算子内部计算的可并行性,导致计算资源的利用率低,整体耗时长。为此,设计了基于细粒度模型划分和多流并行调度方法的LSTM训练系统TurboLSTM,在英伟达GPU和国产昇腾NPU这两种异构硬件上构建的全新底层训练算子实现了任务对计算资源的合理利用。与已有训练系统相比,在GPU上TurboLSTM的单算子训练时间缩短了23%,模型的整体训练时间缩短了17%,在NPU上TurboLSTM的单算子训练时间缩短了15%,且对计算资源的利用率显著提高。这表明提出的加速方案是高效的,具有良好的泛化能力。In the era of big data,deep neurals network models represented by LSTM have the ability to process massive data,and have excellent performance in the fields of language processing,speech recognition and time series data prediction.However,with the increase of model complexity,the training cost increases significantly.The existing LSTM training systems use acceleration methods,such as operator fusion and multi-stream,but neglect the parallelism of the internal calculation of a single training operator,which leads a low utilization rate of computing resources and a long traning time.Therefore,this paper designs a training acceleration system called TurboLSTM based on fine-grained model partitioning method and multi-stream parallel scheduling strategy.A new underlying training operator built on NVIDIA GPU and domestic Ascend NPU heterogeneous hardware realizes reasonable utilization of computing resources for tasks.Compared with the existing training systems,TurboLSTM on NVIDIA GPU has about 23%speed improvement of a single operator and about 17%speed improvement of the overall training time of a model,while TurboLSTM on Ascend NPU has about 15%speed improvement of a single operator,and the significant increase in the utilization of computing resources is observed.This shows that the acceleration method is efficient and has good generalization ability.

关键词：LSTM 训练加速细粒度并行多流调度

分类号：TP183[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于异构硬件的LSTM训练系统

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于异构硬件的LSTM训练系统

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索