TSCompiler:efficient compilation framework for dynamic-shape models  

在线阅读下载全文

作  者:Xiang LUO Chen ZHANG Chenbo GENG Yanzhi YI Jiahui HU Renwei ZHANG Zhen ZHANG Gianpietro CONSOLARO Fan YANG Tun LU Ning GU Li SHANG 

机构地区:[1]School of Computer Science,Fudan University,Shanghai 200433,China [2]School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China [3]School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150001,China [4]Huawei Technologies Co.,Ltd.,Beijing 100095,China [5]Huawei Paris Research Center,Paris 92100,France [6]School of Microelectronics,Fudan University,Shanghai 201203,China

出  处:《Science China(Information Sciences)》2024年第10期63-80,共18页中国科学(信息科学)(英文版)

基  金:supported in part by National Natural Science Foundation of China(Grant Nos.62090025,92373207);National Key Research and Development Program of China(Grant Nos.2023YFB4405101,2022YFB4400400,2023YFB4405103,2023YFB4405104)。

摘  要:Today's deep learning models face an increasing demand to handle dynamic shape tensors and computation whose shape information remains unknown at compile time and varies in a nearly infinite range at runtime.This shape dynamism brings tremendous challenges for existing compilation pipelines designed for static models which optimize tensor programs relying on exact shape values.This paper presents TSCompiler,an end-to-end compilation framework for dynamic shape models.TSCompiler first proposes a symbolic shape propagation algorithm to recover symbolic shape information at compile time to enable subsequent optimizations.TSCompiler then partitions the shape-annotated computation graph into multiple subgraphs and fine-tunes the backbone operators from the subgraph within a hardware-aligned search space to find a collection of high-performance schedules.TSCompiler can propagate the explored backbone schedule to other fusion groups within the same subgraph to generate a set of parameterized tensor programs for fused cases based on dependence analysis.At runtime,TSCompiler utilizes an occupancy-targeted cost model to select from pre-compiled tensor programs for varied tensor shapes.Extensive evaluations show that TSCompiler can achieve state-of-the-art speedups for dynamic shape models.For example,we can improve kernel efficiency by up to 3.97×on NVIDIA RTX3090,and 10.30×on NVIDIA A100 and achieve up to five orders of magnitude speedups on end-to-end latency.

关 键 词:machine learning tensor compilers dynamic shape operator fusion code generation AUTO-TUNING 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程] TP314[自动化与计算机技术—控制科学与工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象