机构地区:[1]Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China [2]College of Computer, National University of Defense Technology, Changsha 410073, China [3]ATR Laboratory, National University of Defense Technology, Changsha 410073, China
出 处:《Frontiers of Computer Science》2014年第3期378-390,共13页中国计算机科学前沿(英文版)
基 金:Acknowledgements This work was partially supported by National High-tech R&D Program of China (863 Program) (2012AA01A301, 2012AA010901), by Program for New Century Excellent Talents in University and by National Natural Science Foundation of China (Grant Nos. 61272142, 61103082, 61170261, and 61103193).
摘 要:With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the "reliability wall", which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay- 2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the "reliability wall", which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay- 2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.
关 键 词:SUPERCOMPUTER autonomous management fault tolerant fault management MilkyWay-2 system
分 类 号:TP338.4[自动化与计算机技术—计算机系统结构] TP311.13[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...