检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
出 处:《计算机研究与发展》2010年第4期589-594,共6页Journal of Computer Research and Development
基 金:supported by the National Defense Pre-research Project Foundation under grant No.513160401
摘 要:随着高性能计算机系统性能的不断提升和硬件规模的不断扩大,如何实现系统的可靠运行,是高性能计算机尤其是P级计算机研制中面临的重要技术挑战.从高性能计算机对可靠性技术的需求出发,全面介绍了高性能计算机硬件设计中的可靠性技术现状,包括避错、静态冗余、动态冗余和在线替换等技术,详细分析了各种可靠性技术在典型机器中的应用情况;最后对高性能计算机可靠性技术的发展趋势进行了深入探讨,包括多核处理器的可靠性设计、全方位的内存防护技术和刀片式的冗余架构.As the system performance of high performance computers (HPC) becomes higher and higher and its hardware scale continuously increases, how to realize highly reliable operation of the system is a great challenge in tera-scale and peta-scale HPC research and development. Beginning with the requirement for high reliability technology .from HPC, the authors completely introduce the present reliability technologies in HPC hardware design, such as fault avoidance, static redundancy, dynamic redundancy, and online replacement, in which static redundancy includes such fault masking technologies as part redundancy, data path redundancy and information redundancy, and dynamic redundancy includes such reliability technologies as fault detection and diagnosis, reconstruction and recovery. Combined with online replacement technology, redundancy technology can greatly improve system RAS (reliability, availability, serviceability). Detailedly analyzed is the specific application of all kinds of reliability technologies in typical IBM, HP and Cray systems. Finally discussed is the future trend of reliability technology in peta-scale HPC, suggesting that in the development of peta- scale high performance computers, much work should focus on reliability design of multi-core processor and the all-round memory protection, and it is pointed out that blade architecture is beneficial to the realization of modularizational redundancy and online replacement of components.
关 键 词:高性能计算机 可靠性 避错 容错 冗余 在线替换
分 类 号:TP302.7[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.114