检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]国防科学技术大学计算机学院,长沙410073
出 处:《计算机研究与发展》2005年第6期987-992,共6页Journal of Computer Research and Development
基 金:国家"八六三"高技术研究发展计划基金项目(2002AA1Z2101)
摘 要:分布式检查点系统是大规模并行计算系统容错的重要手段.协议开销和检查点映像存储成为困扰并行检查点系统可伸缩性的两大瓶颈.针对并行应用程序的执行特征和高性能集群的体系结构特点,C系统分别采用动态虚连接技术和分布存储检查点映像的方法来有效降低协同式检查点的开销,增强检查点系统的可伸缩性.初步测试结果表明,C系统的设计策略适合大规模并行计算的容错.As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on parallel computing. Two bottlenecks, checkpointing protocol overhead and storage cost of checkpoint image, limit the scalability of checkpoint system, which is critical to large-scale clusters. To address these issues, the design of C system is presented which provides coordinated checkpointing based on dynamic virtual connection and distributed checkpoint image storage for MPI-based parallel applications. Full use is made of some characteristics of parallel applications and capability of local disks of cluster system to reduce checkpointing cost of large scale parallel job. C system is suitable to large scale cluster and initial experimental results show negligible performance impact due to the incorporation of the mechanism into the C system implemented on the cluster testbed.
分 类 号:TP316.4[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.217.200.151