检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]国防科学技术大学计算机学院,湖南长沙410073
出 处:《软件学报》2012年第4期1022-1035,共14页Journal of Software
基 金:国家自然科学基金(61003081;61003087;60921062)
摘 要:随着系统规模的扩大,并行计算的性能不断提高,但可靠性却也在不断下降,因此需要采用某种容错机制来容忍或恢复硬件故障和数据错误.目前常用的容错机制Checkpoint/Restart和多模冗余均引入了额外的开销,这些开销均在某种程度上制约了并行计算的可扩展性.因此,在高性能计算需求不断增长的今天,可扩展容错机制的设计显得尤为迫切和重要.以三模冗余(triple modular redundancy,简称TMR)为典型案例,描述了传统TMR在大规模MPI并行计算上的实现方法,分析了该机制所面临的实际问题,进而指出传统TMR制约了并行计算的扩展.根据该技术所面临的问题,设计了可扩展三模冗余(scalable triple modular redundancy,简称STMR),并进一步验证了其有效性和可扩展性.该机制不仅能够处理Checkpoint/Restart针对的fail-stop故障,还能够解决绝大部分硬件不能直接感知的数据错误.最后,借用BlueGene/L的系统参数进行模拟,预测当系统规模增大时,在分别采用TMR和STMR的情况下并行计算可扩展性的变化,结果进一步验证了STMR是可扩展的容错机制.The scale-up of system brings improvement in performance as well as reliability degradation, so there is a need to apply some fault tolerance mechanism to tolerate hardware failure or recover data. Currently, the popular fault tolerance mechanisms, such as Checkpoint/Restart and N-modular redundancy, all need additional overhead, which limits the scalability of parallel computing to some extent. Therefore, it is very important to develop scalable fault tolerance mechanisms for increasingly high performance supercomputing. This paper takes triple modular redundancy (TMR) as an example, describes the implementation of TMR on large-scale MPI parallel computing, and argues that traditional TMR fault-tolerant mechanism limits the scalability of parallel computing. To solve these practical problems, the paper proposes the scalable triple modular redundancy (STMR), and verifies the validity and scalability of it. STMR can not only handle the fail-stop failures that are traditionally handled by Checkpoint/Restart, but can also deal with most of data errors not perceived directly by the hardware. Finally, the study conducts the simulation using the system parameters of BlueGene/L, which shows the scalability change of parallel computing with the TMR and the STMR respectively when the system size increases. The results further validate STMR position as scalable fault-tolerant mechanism.
关 键 词:容错机制 可扩展性 三模冗余 大规模并行计算 MPI
分 类 号:TP302[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.249