Software approaches for resilience of high performance computing systems:a survey  被引量:1

在线阅读下载全文

作  者:Jie JIA Yi LIU Guozhen ZHANG Yulin GAO Depei QIAN 

机构地区:[1]School of Computer Science and Engineering,Beihang University,Beijing 100191,China [2]Sino-German Joint Software Institute,Beihang University,Beijing 100191,China

出  处:《Frontiers of Computer Science》2023年第4期43-56,共14页中国计算机科学前沿(英文版)

基  金:supported by the GHFund A(No.ghfund202107010337).

摘  要:With the scaling up of high-performance computing systems in recent years,their reliability has been descending continuously.Therefore,system resilience has been regarded as one of the critical challenges for large-scale HPC systems.Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs.This paper provides a comprehensive survey of existing software resilience approaches.Firstly,a classification of software resilience approaches is presented;then we introduce major approaches and techniques,including checkpointing,replication,soft error resilience,algorithmbased fault tolerance,fault detection and prediction.In addition,challenges exposed by system-scale and heterogeneous architecture are also discussed.

关 键 词:RESILIENCE high-performance computing fault tolerance CHALLENGE 

分 类 号:TP311.1[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象