面向E量级超算的并行循环压缩浮点乘加校验结构  

Exascale Supercomputer Oriented Parallel Cyclic Compression Based Checking Structure for Floating-Point Fused Multiply-Add Unit

在线阅读下载全文

作  者:高剑刚 刘骁 郑方 唐勇 GAO Jian-Gang;LIU Xiao;ZHENG Fang;TANG Yong(National Research Center of Parallel Computer Engineering and Technology,Beijing 100190)

机构地区:[1]国家并行计算机工程技术中心,北京100190

出  处:《计算机学报》2023年第6期1103-1120,共18页Chinese Journal of Computers

摘  要:E量级超算面临超十亿浮点融合乘加(Fused Multiply-Add,FMA)部件同时运行的严峻挑战,单个FMA检错率的少量变化可引起系统可用性的较大变动.E级超算核心的高运行频率、实时校验需求对校验逻辑时序提出了更高的要求.同时,E级超算需要控制系统规模,同芯片面积下集成的核心数目更多,片上资源较为紧张.因此,FMA校验设计需要在保证错误检测能力的前提下,对校验逻辑的时序、面积开销进行控制.本文提出了并行循环4:2压缩结构.余数系统模数增大后,并行循环4:2压缩结构能在降低余数生成逻辑的时序、面积开销的同时,提升余数系统的检错能力.本文还对余数域中的FMA尾数运算进行研究,提出了取反符号扩展操作、乘法尾数、加法尾数的余数域加速变换.实验结果表明,本文提出的并行循环4:2混合压缩余数生成逻辑较模加器树余数生成逻辑、CSA(Carry Saved Adder) 3:2压缩余数生成逻辑分别最多可取得19.64%、6.75%的时序优化和71%、18.18%的面积优化.基于并行循环4:2压缩树的模63余数校验在面积开销、检错率、系统可用性上均优于IBM采用的模15浮点FMA校验设计,面积开销、检错率优化效果分别能达到67.61%、5%,系统可用性优化最多可达49.6%.Simultaneously operating of billions of floating-point FMA(Fused Multiply-Add)units has raised severe availability challenges for the exascale supercomputer.To ensure sustainable and efficient operation of the exascale supercomputer,processors must adopt more efficient fault-tolerance mechanisms on FMA.In the exascale supercomputer,the real-time check on high frequency processor and limited resources on chip challenge the design of FMA checker.The design of FMA checker must take timing overhead and hardware overhead into consideration under the premise of getting better error detection coverage.Floating-point FMA adopts a fusion design and has to deal with multiple special operations in IEEE 754 standard,such as mantissa align shift,normalization,round;as a result,the widely-used residue domain transformation is not able to effectively accelerate the residue encoding in FMA units.In this paper,we propose a parallel cyclic 4∶2 compressor-based residue generation technique,which reduces the number of logic gates on the critical path when the modulus is increasing.By adopting cyclic carry processing for the highest bit of each partition,cyclic 4∶2 compressors abate the logical dependency in carry chains and reduce the overhead caused by carry correction.When improving error detection coverage,the cyclic 4∶2 compressor can reduce the timing cost and hardware overhead of residue generation.We also study the mantissa calculation in residue domain and propose the residue domain compression technology for negative sign extension of mantissa,mantissa multiplication and mantissa addition based on mathematical transformations.These techniques reduce the input data width of the residue generator and limit the alignment range by dividing and transforming the mantissa fusion operation.For the reverse sign extension of mantissa in residue domain,this paper decreases the overhead by transforming the negative sign extension operation to the combined operations of residue generation and modular subtraction.For the mantissa m

关 键 词:浮点融合乘加 可用性 浮点校验 模加器 并行循环压缩 

分 类 号:TP302[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象