基于张量虚拟机的快速卷积自动性能优化  被引量:1

Fast Convolution Automatic Performance Optimization Based on Tensor Virtual Machine

在线阅读下载全文

作  者:陈疆 朱泓霖 孟金涛[2] 魏彦杰[2] CHEN Jiang;ZHU Honglin;MENG Jintao;WEI Yanjie(Southern University of Science and Technology,Shenzhen 518055,China;Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences,Shenzhen 518055,China;Shenzhen Tencent Computer System Co.Ltd.,Shenzhen 518063,China)

机构地区:[1]南方科技大学,深圳518055 [2]中国科学院深圳先进技术研究院,深圳518055 [3]深圳市腾讯计算机系统有限公司,深圳518063

出  处:《集成技术》2024年第5期3-18,共16页Journal of Integration Technology

基  金:广东省重点领域研发计划资助项目(2021B0101310002);国家自然科学基金项目(62272449);深圳市基础研究项目(RCYX20200714114734194,KQTD20200820113106007,ZDSYS20220422103800001);中国科学院青年创新促进会项目(Y2021101)。

摘  要:卷积神经网络作为深度学习的典型代表,是计算机视觉等任务中最常用的神经网络,然而,卷积运算通常占整个卷积神经网络运行时的90%以上,成为卷积神经网络的性能瓶颈。此外,由于当下硬件的复杂性及工作负载的多样性,之前工作中的一些特定优化往往缺乏性能可移植性。对此,作者提出BlazerML,一个基于张量虚拟机(TVM)模板代码自动生成的开源卷积计算库,可为任何输入形状自动生成高性能的卷积实现。BlazerML是基于Winograd算法实现的,因为该算法是快速卷积算法中性能最高的算法。实验结果表明:BlazerML显著优于当下最先进的开源库。在x86 CPU上运行常见的深度学习网络前向推理分别比OnnxRuntime、MNN和TVM社区版本快1.18~2.47倍、1.18~2.27倍和1.01~1.66倍。在ARMCPU上运行常见深度学习网络的单层推理分别比ACL和FastConv快1.26~6.11倍、1.04~4.28倍。Convolutional Neural Networks(CNNs)as a quintessential representation of deep learning,are the most commonly used neural networks in tasks such as computer vision.However,convolution operations typically account for over 90%of the runtime in CNNs,becoming a bottleneck for performance.Additionally,due to the complexity of current hardware and the diversity of workloads,specific optimizations in previous work often lack performance portability.To address this problem,the author introduces BlazerML,an open-source convolution computation library based on auto-generated code templates from TVM,capable of automatically generating high-performance convolution implementations for any input shape.BlazerML is implemented based on the Winograd algorithm,known for its high performance in fast convolution algorithms.Experimental results demonstrate that BlazerML significantly outperforms current state-of-the-art open-source libraries.On x86 CPUs,running common deep learning network forward inferences,it is faster by 1.18—2.47 times,1.18—2.27 times,and 1.01—1.66 times compared to OnnxRuntime,MNN,and the TVM community version,respectively.On ARM CPUs,for single-layer inference of common deep learning networks,it surpasses ACL and FastConv by 1.26—6.11 times and 1.04—4.28 times,respectively.

关 键 词:深度学习 卷积神经网络 快速卷积算法 Winograd算法 TVM 自动性能优化 

分 类 号:TP183[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象