一种面向二维三维卷积的GPGPU cache旁路系统被引量：1

GPGPU cache bypassing system for 2D and 3D convolution

作　　者：贾世伟张玉明[1] 秦翔孙成璐田泽 JIA Shiwei;ZHANG Yuming;QIN Xiang;SUN Chenglu;TIAN Ze(School of Microelectronics,Xidian University,Xi’an 710071,China;Department of Integrated Circuit R&D,Xiangteng microelectronics corporation,Xi’an 710068,China;Key Laboratory of Aviation and Technology on Integrated Circuit and Micro-System Design,China Institute of Aeronautical Computing Technology,Xi’an 710068,China)

机构地区：[1]西安电子科技大学微电子学院,陕西西安710071 [2]西安翔腾微电子科技有限公司,陕西西安710068 [3]中国航空计算技术研究所集成电路与微系统设计航空科技重点实验室,陕西西安710068

出　　处：《西安电子科技大学学报》2023年第2期92-100,共9页Journal of Xidian University

基　　金：装备联合基金(6141B05200305)。

摘　　要：通用图形处理器作为卷积神经网络的核心加速平台,其处理二维、三维卷积的性能,决定着神经网络在实时目标识别检测领域的有效应用。然而,受其固有cache系统功能的限制,当前通用图形处理器架构无法实现二维、三维卷积的高效加速。针对此问题,首先提出一种L1Dcache动态旁路设计方案。该方案定义了一组能够动态反映指令访问cache特征的数据结构,并基于此数据结构定义访存特征记录表,以记录不同访存指令在请求cache时的执行状态。其次,采用优先线程块的warp调度策略来加速访存状态的采样。最后根据访存状态得出不同PC值下访存请求对L1Dcache的旁路的判定,并动态完成部分低局域性数据请求对L1Dcache的旁路。由此将L1Dcache空间保留给高局域性的数据并降低二维、三维卷积执行时的访存阻塞周期,进而提升了二维、三维卷积在通用图形处理器上执行时的访存效率。实验结果表明,相比原架构,在面向二维、三维卷积时分别带来了约2.16%与19.79%的性能提升,体现了设计方案的有效性与实用性。As the core computing platform of the convolution neural network,general-purpose graphics processor(GPGPU),its performance of processing two-dimensional and three-dimensional convolution determines the application of the neural network in real-time target recognition and detection.However,limited by inherent cache system design,the current GPGPU architecture cannot achieve efficient acceleration of 2D and 3D convolution computing.Aiming at this problem,a dynamic L1Dcache bypassing design for this problem is proposed.First,we define a new data structure that can dynamically reflect the cache access characteristics of an instruction,and then defines a memory-access-feature record table based on this information,in order to record the execution status of different memory accesses.Second,the warp scheduling strategy with the priority thread block is adopted,which can speed up the sampling of the memory access state.Next,the L1Dcache bypassing decision of memory accesses under different PCs is obtained due to the sampling results.Finally,the L1Dcache bypassing of some low-locality data accesses is completed.As a result,the L1Dcache space is reserved for data with high locality and the memory access stall cycle of 2D and 3D convolution is reduced.In addition,the memory access efficiency of 2D and 3D convolution has been improved.Compared with the original design,experimental results show that the L1Dcache bypassing design brings 2.16%performance improvements in 2D convolution and 19.79%in 3D convolution.Experiments prove the effectiveness and practicality of this design.

关键词：卷积通用图形处理器存储系统 cache旁路

分类号：TN4[电子电信—微电子学与固体电子学]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种面向二维三维卷积的GPGPU cache旁路系统被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种面向二维三维卷积的GPGPU cache旁路系统 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种面向二维三维卷积的GPGPU cache旁路系统被引量：1