基于MLIR的FP8量化模拟与推理内存优化  

FP8 Quantization and Inference Memory Optimization Based on MLIR

在线阅读下载全文

作  者:徐金龙 桂中华[2] 李嘉楠 李颖颖 韩林 XU Jinlong;GUI Zhonghua;LI Jia’nan;LI Yingying;HAN Lin(National Supercomputing Center in Zhengzhou,Zhengzhou University,Zhengzhou 450001,China;School of Computer and Artificial Intelligence,Zhengzhou University,Zhengzhou 450001,China;Fourth School,Information Engineering University,Zhengzhou 450001,China)

机构地区:[1]郑州大学国家超级计算郑州中心,郑州450001 [2]郑州大学计算机与人工智能学院,郑州450001 [3]信息工程大学四院,郑州450001

出  处:《计算机科学》2024年第9期112-120,共9页Computer Science

基  金:2022年河南省重大科技专项(221100210600)。

摘  要:随着目标检测模型和语言大模型的迅速发展,网络模型正变得越来越庞大。为了更好地在端侧硬件上进行模型部署,通常采用模型量化技术对模型进行压缩。现有的模型量化策略主要基于FP16,BF16和INT8等类型实现。其中,8bit数据类型在降低推理内存占用与部署开销方面最为显著,但INT8类型依赖特定的校准算法,未能很好地处理动态范围大、离群点多的模型。FP8类型能够更好地拟合神经网络中的数据分布,同时具有多种数制,可在表达范围和表达精度上灵活调整。然而,当前MLIR系统缺乏对FP8类型量化的支持。为此,提出了一种基于MLIR系统的FP8量化模拟策略,包含FP8E4M3和FP8E5M2两种数制,通过对网络中的算子进行量化模拟,评估FP8两种数制对模型推理精度的影响。同时,针对推理引擎中存在的内存分配冗余问题,提出了一种基于定义使用链的内存复用策略,使得模型推理过程中的内存占用峰值进一步减小。实验选取了典型的Yolov5s和Resnet50模型进行测试,结果表明相较于现有的INT8量化策略,FP8量化策略能够保持更好的模型精度,同时不依赖特定校准算法,部署更为简便。在模型精度上,测试用例分别达到了55.5%和77.8%的准确度,经过内存复用优化,内存占用峰值降低了约15%~20%。With the development of object detection models and language models,network models are becoming increasingly large.In order to better deploy the model on the end-to-end hardware,model quantization technology is usually used to compress the model.The existing model quantization strategies are mainly implemented based on FP16,BF16,INT8,and other types.Among them,the 8-bit data type is the most significant in reducing inference memory usage and deployment costs,but the INT8 type relies on specific calibration algorithms and fails to handle models with large dynamic ranges and multiple outliers well.The FP8 type can better fit the data distribution in neural networks,and has multiple formats that can be flexibly adjusted in terms of expression range and accuracy.However,the current MLIR lacks support for quantifying the FP8 type.To this end,a FP8 quantization simulation strategy based on MLIR is proposed,which includes two formats:FP8E4M3 and FP8E5M2.By quantifying and simulating the operators in the network,the impact of the two formats on the inference accuracy of the model is evaluated.A memory reuse strategy based on define use chain is proposed to address the issue of memory allocation redundancy in inference engines,further reducing the peak memory usage during the model inference process.Typical Yolov5s and Resnet50 models are selected for testing and verification,and the results show that,compared to the existing INT8 quantization strategy,the FP8 quantization strategy can maintain better model accuracy,and does not rely on specific calibration algorithms,making deployment more convenient.In terms of model accuracy,the test cases achieve an accuracy of 55.5%and 77.8%,respectively.After memory reuse optimization,the peak memory usage is reduced by about 15%~20%.

关 键 词:模型压缩 深度学习编译器 FP8量化 MLIR Yolov5s模型 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象