检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:刘晓航 姜晶菲 许金伟 LIU Xiao-hang;JIANG Jing-fei;XU Jin-wei(Graduate College,National University of Defense Technology,Changsha 410073;Science and Technology on Parallel and Distributed Processing Laboratory,National University of Defense Technology,Changsha 410073,China)
机构地区:[1]国防科技大学研究生院,湖南长沙410073 [2]国防科技大学并行与分布处理国家重点实验室,湖南长沙410073
出 处:《计算机工程与科学》2023年第5期802-809,共8页Computer Engineering & Science
基 金:国家国防科技工业局国防科技重点实验室稳定支持重点项目(WDZC20215250103)。
摘 要:注意力机制最近在深度神经网络中表现出优越的性能,但其计算包含复杂的数据流,内存开销和计算量大,需要定制加速器来优化推理计算。提出一种针对注意力机制计算的加速器结构。采用基于硬件控制的灵活分块方法,将模型中的巨大矩阵分成硬件亲和的计算块,使块矩阵的计算匹配加速器脉动阵列;提出基于双步softmax函数分解计算的层融合计算方法,有效减少了注意力模型计算对内存的访问。采用硬件描述语言HDL设计实现了细粒度计算调度的层融合注意力模型加速器结构。基于XILINX FPGA器件和HLS工具进行了性能评估。相同设置下,与CPU相比延迟加速了4.9倍,与GPU相比能效提升了1.24倍。Attention mechanism has recently shown superior performance in deep neural networks,its computation generates complex data flow and requires high computation and memory overheads.Therefore,customized accelerators are required to optimize the inference computing.This paper proposes an accelerator architecture for attention mechanism computation.A flexible partitioning method based on hardware control is used to divide the huge matrices in the attention model into hardwarefriendly computing blocks,which realizes the systolic array in accelerator matched by the block computation match.A layer fusion computing structure based on two-step softmax function decomposition is proposed,which effectively reduces the memory access of attention mechanism computation.A fusedlayer attention model accelerator based on fine-grained computational scheduling is designed and implemented by HDL.The performance was evaluated based on the XLINIX FPGA device and HLS tool.Compared with the CPU and GPU implementation under the same settings,the delay of accelerator was improved by 4.91 times,the efficiency of accelerator was improved by 1.24 times.
关 键 词:脉动阵列 注意力机制 层融合 加速器结构 矩阵分块 柔性最大值传输函数
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.112