检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王攀峰[1] 杜云飞[1] 富弘毅[1] 杨学军[1] 周海芳[1]
机构地区:[1]国防科技大学计算机学院并行与分布处理国家重点实验室,长沙410073
出 处:《计算机科学》2009年第3期21-25,共5页Computer Science
基 金:国家自然科学基金项目(60621003和60603081)资助
摘 要:Checkpointing是高性能计算领域最常用的容错技术。但是,当处理器数目变大时,这种技术的性能迅速恶化。提出一种在并行计算中容忍单进程故障的新方法:并行复算。这种方法的主要特征是利用冗余处理器的计算能力而不是冗余磁盘的存储能力实现低开销的容错。还提出这种方法的一个优化方法,将并行复算与checkpoint技术相结合,以进一步减小容错开销,并通过举例说明如何开发一个基于并行复算以及其优化方法的并行程序。最后通过实验对该方法进行评估。结果显示,当处理器数目变大时,并行复算的开销低于checkpointing,其优化方法能提供优于并行复算的性能。Checkpointing is the most commonly used scheme for tolerating faults in high-performance computing systems. But this scheme has its performance limitation when the number of processors becomes much larger. The paper proposed a new approach called parallel recomputing for tolerating a single process failure in parallel computing. The main feature of our approach is that it utilizes the computing power of the redundant processor instead of the storage capacity. The paper also presented an optimization of this approach which is a combination of parallel reeomputing and checkpointing, and then illustrated how to incorporate parallel recomputing and its optimization into a parallel program. Experimental results demonstrate that the overhead of parallel recomputing is less than checkpointing when the number of processors becomes large, and its optimization can provide a better performance than parallel recomputing.
分 类 号:TP301.6[自动化与计算机技术—计算机系统结构] U463.212[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222