Fault tolerance on-chip:a reliable computing paradigm using self-test,self-diagnosis,and self-repair (3S)approach  

Fault tolerance on-chip: a reliable computing paradigm using self-test, self-diagnosis, and self-repair (3S) approach

在线阅读下载全文

作  者:Xiaowei LI Guihai YAN Jing YE Ying WANG 

机构地区:[1]State Key Laboratory of Computer Architecture,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China [2]University of Chinese Academy of Sciences,Beijing 100049,China

出  处:《Science China(Information Sciences)》2018年第11期27-43,共17页中国科学(信息科学)(英文版)

基  金:supported by National Natural Science Foundation of China (Grant Nos. 61532017, 61572470, 61521092, 61522406, 61432017, 61376043);in part by Youth Innovation Promotion Association, CAS (Grant No. Y404441000)

摘  要:If your computer crashes, you can revive it by a reboot, an empirical solution that usually turns out to be effective. The rationale behind this solution is that transient faults, either in hardware or software,can be fixed by refreshing the machine state. Such a "silver bullet", however, could be futile in the future because the faults, especially those existing in the hardware such as Integrated Circuit(IC) chips, cannot be eliminated by refreshing. What we need is a more sophisticated mechanism to steer the system back to the right track. The "magic cure" is the Fault Tolerance On-Chip(FTOC) mechanism, which relies on a suite of built-in design-for-reliability logic, including fault detection, fault diagnosis, and error recovery, working in a self-supportive manner. We have exploited the FTOC to build a holistic solution ranging from on-chip fault detection to error recovery mechanisms to address faults caused by chips progressively aging. Besides fault detection, the FTOC paradigm provides attractive benefits, such as facilitating graceful performance degradation, mitigating the impact of verification blind spots, and improving the chip yield.If your computer crashes, you can revive it by a reboot, an empirical solution that usually turns out to be effective. The rationale behind this solution is that transient faults, either in hardware or software,can be fixed by refreshing the machine state. Such a "silver bullet", however, could be futile in the future because the faults, especially those existing in the hardware such as Integrated Circuit(IC) chips, cannot be eliminated by refreshing. What we need is a more sophisticated mechanism to steer the system back to the right track. The "magic cure" is the Fault Tolerance On-Chip(FTOC) mechanism, which relies on a suite of built-in design-for-reliability logic, including fault detection, fault diagnosis, and error recovery, working in a self-supportive manner. We have exploited the FTOC to build a holistic solution ranging from on-chip fault detection to error recovery mechanisms to address faults caused by chips progressively aging. Besides fault detection, the FTOC paradigm provides attractive benefits, such as facilitating graceful performance degradation, mitigating the impact of verification blind spots, and improving the chip yield.

关 键 词:fault tolerance ON-CHIP SELF-TEST SELF-DIAGNOSIS SELF-REPAIR 

分 类 号:N[自然科学总论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象