A Large-Scale Study of Failures on Petascale Supercomputers  被引量:2

A Large-Scale Study of Failures on Petascale Supercomputers

在线阅读下载全文

作  者:Rui-Tao Liu Zuo-Ning Chen 

机构地区:[1]State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 213215, China [2]National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China

出  处:《Journal of Computer Science & Technology》2018年第1期24-41,共18页计算机科学技术学报(英文版)

摘  要:With the rapid development of supercomputers, the scale and complexity are ever increasing, and the reliability and resilience are faced with larger challenges. There are many important technologies in fault tolerance, such as proacrive failure avoidance technologies based on fault prediction, reactive fault tolerance based on checkpoint, and scheduling technologies to improve reliability. Both qualitative and quantitative descriptions on characteristics of system faults are very critical for these technologies, This study analyzes the source of failures on two typical petascale supercomputers called Sunway BlueLight (based on multi-core CPUs) and Sunway TaihuLight (based on heterogeneous manycore CPUs). It uncovers some interesting fault characteristics and finds unknown correlation relationship among main components' faults. Finally the paper analyzes the failure time of the two supercomputers in various grains of resource and different time spans, and builds a uniform multi-dimensional failure time model for petascale supereomputers.With the rapid development of supercomputers, the scale and complexity are ever increasing, and the reliability and resilience are faced with larger challenges. There are many important technologies in fault tolerance, such as proacrive failure avoidance technologies based on fault prediction, reactive fault tolerance based on checkpoint, and scheduling technologies to improve reliability. Both qualitative and quantitative descriptions on characteristics of system faults are very critical for these technologies, This study analyzes the source of failures on two typical petascale supercomputers called Sunway BlueLight (based on multi-core CPUs) and Sunway TaihuLight (based on heterogeneous manycore CPUs). It uncovers some interesting fault characteristics and finds unknown correlation relationship among main components' faults. Finally the paper analyzes the failure time of the two supercomputers in various grains of resource and different time spans, and builds a uniform multi-dimensional failure time model for petascale supereomputers.

关 键 词:petascale supercomputer fault characteristic correlation relationship MULTI-DIMENSION failure time model 

分 类 号:TP338.4[自动化与计算机技术—计算机系统结构] TP332[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象