检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:贾世伟 张玉明[1] 秦翔 孙成璐 田泽 JIA Shiwei;ZHANG Yuming;QIN Xiang;SUN Chenglu;TIAN Ze(School of Microelectronics,Xidian University,Xi’an 710071,China;Department of Integrated Circuit R&D,Xiangteng microelectronics corporation,Xi’an 710068,China;Key Laboratory of Aviation and Technology on Integrated Circuit and Micro-System Design,China Institute of Aeronautical Computing Technology,Xi’an 710068,China)
机构地区:[1]西安电子科技大学微电子学院,陕西西安710071 [2]西安翔腾微电子科技有限公司,陕西西安710068 [3]中国航空计算技术研究所集成电路与微系统设计航空科技重点实验室,陕西西安710068
出 处:《西安电子科技大学学报》2023年第2期92-100,共9页Journal of Xidian University
基 金:装备联合基金(6141B05200305)。
摘 要:通用图形处理器作为卷积神经网络的核心加速平台,其处理二维、三维卷积的性能,决定着神经网络在实时目标识别检测领域的有效应用。然而,受其固有cache系统功能的限制,当前通用图形处理器架构无法实现二维、三维卷积的高效加速。针对此问题,首先提出一种L1Dcache动态旁路设计方案。该方案定义了一组能够动态反映指令访问cache特征的数据结构,并基于此数据结构定义访存特征记录表,以记录不同访存指令在请求cache时的执行状态。其次,采用优先线程块的warp调度策略来加速访存状态的采样。最后根据访存状态得出不同PC值下访存请求对L1Dcache的旁路的判定,并动态完成部分低局域性数据请求对L1Dcache的旁路。由此将L1Dcache空间保留给高局域性的数据并降低二维、三维卷积执行时的访存阻塞周期,进而提升了二维、三维卷积在通用图形处理器上执行时的访存效率。实验结果表明,相比原架构,在面向二维、三维卷积时分别带来了约2.16%与19.79%的性能提升,体现了设计方案的有效性与实用性。As the core computing platform of the convolution neural network,general-purpose graphics processor(GPGPU),its performance of processing two-dimensional and three-dimensional convolution determines the application of the neural network in real-time target recognition and detection.However,limited by inherent cache system design,the current GPGPU architecture cannot achieve efficient acceleration of 2D and 3D convolution computing.Aiming at this problem,a dynamic L1Dcache bypassing design for this problem is proposed.First,we define a new data structure that can dynamically reflect the cache access characteristics of an instruction,and then defines a memory-access-feature record table based on this information,in order to record the execution status of different memory accesses.Second,the warp scheduling strategy with the priority thread block is adopted,which can speed up the sampling of the memory access state.Next,the L1Dcache bypassing decision of memory accesses under different PCs is obtained due to the sampling results.Finally,the L1Dcache bypassing of some low-locality data accesses is completed.As a result,the L1Dcache space is reserved for data with high locality and the memory access stall cycle of 2D and 3D convolution is reduced.In addition,the memory access efficiency of 2D and 3D convolution has been improved.Compared with the original design,experimental results show that the L1Dcache bypassing design brings 2.16%performance improvements in 2D convolution and 19.79%in 3D convolution.Experiments prove the effectiveness and practicality of this design.
分 类 号:TN4[电子电信—微电子学与固体电子学]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.223.109.25