检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:Xiang LUO Chen ZHANG Chenbo GENG Yanzhi YI Jiahui HU Renwei ZHANG Zhen ZHANG Gianpietro CONSOLARO Fan YANG Tun LU Ning GU Li SHANG
机构地区:[1]School of Computer Science,Fudan University,Shanghai 200433,China [2]School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China [3]School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150001,China [4]Huawei Technologies Co.,Ltd.,Beijing 100095,China [5]Huawei Paris Research Center,Paris 92100,France [6]School of Microelectronics,Fudan University,Shanghai 201203,China
出 处:《Science China(Information Sciences)》2024年第10期63-80,共18页中国科学(信息科学)(英文版)
基 金:supported in part by National Natural Science Foundation of China(Grant Nos.62090025,92373207);National Key Research and Development Program of China(Grant Nos.2023YFB4405101,2022YFB4400400,2023YFB4405103,2023YFB4405104)。
摘 要:Today's deep learning models face an increasing demand to handle dynamic shape tensors and computation whose shape information remains unknown at compile time and varies in a nearly infinite range at runtime.This shape dynamism brings tremendous challenges for existing compilation pipelines designed for static models which optimize tensor programs relying on exact shape values.This paper presents TSCompiler,an end-to-end compilation framework for dynamic shape models.TSCompiler first proposes a symbolic shape propagation algorithm to recover symbolic shape information at compile time to enable subsequent optimizations.TSCompiler then partitions the shape-annotated computation graph into multiple subgraphs and fine-tunes the backbone operators from the subgraph within a hardware-aligned search space to find a collection of high-performance schedules.TSCompiler can propagate the explored backbone schedule to other fusion groups within the same subgraph to generate a set of parameterized tensor programs for fused cases based on dependence analysis.At runtime,TSCompiler utilizes an occupancy-targeted cost model to select from pre-compiled tensor programs for varied tensor shapes.Extensive evaluations show that TSCompiler can achieve state-of-the-art speedups for dynamic shape models.For example,we can improve kernel efficiency by up to 3.97×on NVIDIA RTX3090,and 10.30×on NVIDIA A100 and achieve up to five orders of magnitude speedups on end-to-end latency.
关 键 词:machine learning tensor compilers dynamic shape operator fusion code generation AUTO-TUNING
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.145.38.251