检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:Weiwei WU Fengbin TU Xiangyu LI Shaojun WEI Shouyi YIN
机构地区:[1]School of Integrated Circuits,Tsinghua University,Beijing 100084,China [2]Department of Electronic and Computer Engineering,The Hong Kong University of Science and Technology,Hong Kong 999077,China
出 处:《Science China(Information Sciences)》2024年第2期298-317,共20页中国科学(信息科学)(英文版)
基 金:supported in part by National Natural Science Foundation of China(Grant Nos.U19B2041,62125403,92164301);National Key Research and Development Program(Grant No.2021ZD0114400);Science and Technology Innovation 2030–New Generation of AI Project(Grant No.2022ZD0115201);Beijing National Research Center for Information Science and Technology;Beijing Advanced Innovation Center for Integrated Circuits.
摘 要:On-device training for deep neural networks(DNN)has become a trend due to various user preferences and scenarios.The DNN training process consists of three phases,feedforward(FF),backpropagation(BP),and weight gradient(WG)update.WG takes about one-third of the computation in the whole training process.Current training accelerators usually ignore the special computation property of WG and process it in a way similar to FF/BP.Besides,the extensive data sparsity existing in WG,which brings opportunities to save computation,is not well explored.Nevertheless,exploiting the optimization opportunities would meet three underutilization problems,which are caused by(1)the mismatch between WG data dimensions and hardware parallelism,(2)the full sparsity,i.e.,the sparsity of feature map(Fmap),error map(Emap),and gradient,and(3)the workload imbalance resulting from irregular sparsity.In this paper,we propose a specific architecture for sparse weight gradient(SWG)computation.The architecture is designed based on hierarchical unrolling and sparsity-aware(HUSA)dataflow to exploit the optimization opportunities of the special computation property and full data sparsity.In HUSA dataflow,the data dimensions are unrolled hierarchically on the hardware architecture.A valid-data trace(VDT)mechanism is embedded in the dataflow to avoid the underutilization caused by the two-sided input sparsity.The gradient is unrolled in PE to alleviate the underutilization induced by output sparsity while maintaining the data reuse opportunities.Besides,we design an intra-and inter-column balancer(IIBLC)to dynamically tackle the workload imbalance problem resulting from the irregular sparsity.Experimental results show that with HUSA dataflow exploiting the full sparsity,SWG achieves a speedup of 12.23×over state-of-the-art gradient computation architecture,TrainWare.SWG helps to improve the energy efficiency of the state-of-the-art training accelerator LNPU from 7.56 to 10.58 TOPS/W.
关 键 词:CNN TRAINING gradient computation SPARSITY ARCHITECTURE
分 类 号:TP183[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222