基于飞腾D2000的GEMM算法设计与优化实现技术  

GEMM Algorithm Design and Optimization Implementation Technology Based on Feiteng D2000

在线阅读下载全文

作  者:郑恩 白林亭 文鹏程[1,2] ZHENG En;BAI Lin-ting;WEN Peng-cheng(Xi′an Aeronautics Computing Technique Research Institute,AVIC,Xi′an 710000,China;Key Laboratory of Airborne Missile-borne Computer Aeronautical Science and Technology,Xi′an 710000,China)

机构地区:[1]航空工业西安航空计算技术研究所,陕西西安710000 [2]机载弹载计算机航空科技重点实验室,陕西西安710000

出  处:《航空计算技术》2024年第3期38-41,47,共5页Aeronautical Computing Technique

基  金:航空科学基金项目资助(2022Z071031001)。

摘  要:在深度学习推理框架中,GEMM是典型的计算密集型算子,在Bert、Transformer、Yolo等模型的模块中存在大量GEMM运算,会直接影响模型的推理延时。针对该算子的优化问题,分别采用循环展开、OpenMP、NEON指令集等方法进行优化,在国产嵌入式板卡飞腾D2000、国产操作系统进行实验测试。实验结果表明优化后比优化前加速43.89倍,优化方法加速效果行之有效,可以大大降低人工智能模型在边缘端的推理延时。In the deep learning inference framework,GEMM is a typical calculation-intensive operator.For example,there are a large number of GEMM operations in the modules of Bert,Transformer,Yolo and other models.Therefore,the quality of the underlying implementation of the GEMM operator in the deep learning framework will directly It affects the inference delay of the model.Due to the limited computing power of the edge embedded platform,optimizing this operator is crucial.The main work of this article is to perform embedded optimization on it,using loop expansion,OpenMP,NEON instruction set and other methods for optimization.Experimental tests were conducted on the domestic embedded board Feiteng D2000 and the domestic operating system.The experimental results show that the operator is optimized after The acceleration is 43.89 times faster than before optimization.The acceleration effect of this optimization method is effective and can greatly reduce the inference delay of the artificial intelligence model at the edge.

关 键 词:推理框架 GEMM OPENMP NEON 飞腾D2000 

分 类 号:V247[航空宇航科学与技术—飞行器设计]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象