一种针对栅栏同步的GPGPU微架构优化设计

An Optimization Design of GPGPU Microarchitecture for Barrier Synchronization

作　　者：贾世伟张玉明[1] 田泽秦翔 JIA Shiwei;ZHANG Yuming;TIAN Ze;QIN Xiang(School of Microelectronics,Xidian University,Xi'an,710071,CHN;China Institute of Aeronautical Computing Technology,Key Laboratory of Aviation and Technology on Integrated Circuit and Micro-System Design,Xi'an,710068,CHN;Xiangteng Microelectronics Corporation,Xi'an,710068,CHN)

机构地区：[1]西安电子科技大学微电子学院,西安710068 [2]中国航空计算技术研究所集成电路与微系统设计航空科技重点实验室,西安710068 [3]西安翔腾微电子科技有限公司,西安710068

出　　处：《固体电子学研究与进展》2023年第1期70-77,共8页Research & Progress of SSE

基　　金：装备联合基金资助项目(6141B05200305)。

摘　　要：为了降低通用图形处理器(GPGPU)中栅栏同步开销对程序性能产生的不良影响,提出了一种GPGPU微架构优化设计。该设计在线程束调度模块中,根据栅栏同步开销决定各线程束的调度顺序,确保高栅栏同步开销的线程束能够优先调度执行。在一级数据缓存模块中,结合数据缓存缺失率与栅栏同步状态来共同决定各访存请求是否需要执行旁路操作,由此在不损害数据局域性开发的前提下,降低数据缓存阻塞周期对栅栏同步产生的影响。两种子模块优化设计均能够降低栅栏同步开销。实验结果表明,相比基准GPGPU架构与当前现有的栅栏同步优化策略,本设计在栅栏同步密集类程序中分别带来了4.15%、4.13%与2.62%的每周期指令数提升,证明了优化设计的有效性与实用性。In order to reduce the impact of general-purpose graphics processing unit(GPGPU)barrier synchronization overhead on the execution performance of applications.we propose an optimization design of GPGPU microarchitecture.In warp scheduling module,the scheduling order of each warp is determined according to the barrier synchronization overhead,so as to ensure that warps with high barrier synchronization overhead have the higher priority.In L1 data cache module,the data cache miss rate and the barrier synchronization state are combined to determine whether each memory access needs to perform the bypassing operation or not.Therefore,the influence of cache stall latency on barrier synchronization is reduced without impairing the data locality.The optimization of these two modules can reduce the overhead of barrier synchronization effectively.The experimental results show that our design brings 4.15%,4.13%and 2.62%instruction per cycle(IPC)improvements in barrier synchronization intensive applications respectively,when compared with baseline GPGPU and the current barrier synchronization optimizations.Experiments prove the effectiveness and practicality of our design.

关键词：通用图形处理器栅栏同步线程束调度一级数据缓存缓存旁路性能

分类号：TN4[电子电信—微电子学与固体电子学]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种针对栅栏同步的GPGPU微架构优化设计

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种针对栅栏同步的GPGPU微架构优化设计

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索