异构分布式系统DAG可靠性模型与容错算法  被引量:8

DAG Reliability Model and Fault-Tolerant Algorithm for Heterogeneous Distributed Systems

在线阅读下载全文

作  者:谢国琪[1] 李仁发[1] 刘琳[1] 杨帆[1] 

机构地区:[1]湖南大学嵌入式与网络计算湖南省重点实验室国家超级计算长沙中心,长沙410082

出  处:《计算机学报》2013年第10期2019-2032,共14页Chinese Journal of Computers

基  金:国家自然科学基金重点项目(61133005);国家自然科学基金面上项目(61173036;61070057;61272061);国家自然科学基金青年科学基金项目(61202102);国家"八六三"高技术研究发展计划项目基金(2012AA01A301-01)资助~~

摘  要:异构分布式系统性能得到大幅度提升的同时,却造成故障率大增,以有向无环图(Directed Acyclic Graph,DAG)任务模型研究异构分布式系统的容错调度成为当前的研究热点.广泛采用的基于任务复制的容错算法存在以下问题:(1)DAG任务可靠性需求与DAG可靠性需求的约束存在缺陷且缺乏严谨的理论证明;(2)每个任务仅有一个副版任务,不足以应对任务潜在的多次发生的故障;(3)盲目地使每个任务拥有ε+1个副版来容忍可能的ε个故障,虽然提高了系统的可靠性但易造成系统冗余度过高,并付出昂贵的计算资源.文中首先分析DAG图中任务依赖关系,确定DAG任务的可靠性概率模型,并建立DAG可靠性模型;接着提出满足可靠性目标的任务复制下限值算法、经济的任务复制策略算法和贪婪的任务复制策略算法,精确量化各个任务需要复制的次数,最后在上述算法的基础上提出可选策略的DAG容错算法OPDFT(Optional Policy on DAG Fault-Tolerant).实验表明,OPDFT算法的经济复制策略和贪婪复制策略的可靠性代价分别是盲目策略算法可靠性代价的60%和70%左右.The performance of heterogeneous distributed systems has been improved significant- ly, but caused increased failures dramatically. Tolerant scheduling in heterogeneous distributed systems with DAG (Directed Acyclic Graph) task model becomes a research focus. Widely used fault-tolerant algorithms based on task replication have following problems: (1) there are some deficiencies and lack of rigorous proof on constraint between DAG task reliability requirement and DAG reliability requirement; (2) only one backup copy of each task, which not enough to cope with potential repeated failures; (3) blindly to tolerate s faults of each task with e~ 1 backup copies, which improved the reliability of system, but caused high redundancy and resources con- sumption. Firstly, task dependencies of DAG are analyzed, then the DAG task reliability proba- bility model is determined and based on this, the DAG reliability model is constructed. Secondly, lower limit of task duplication algorithm, economic task duplication strategy algorithm and greedy algorithm for task replication strategy are presented to meet the reliability target of DAG and achieve precise quantification for each task's replicas. Finally, the OPDFT (Optional Policy on DAG Fault-Tolerant) algorithm is proposed based on above 3 algorithms. Experiments show that the reliability cost of economic policy and greed policy of OPDFT algorithm is about 60% and 70%o of blind strategy respectively.

关 键 词:异构分布式系统 可靠性 容错 有向无环图 任务复制 

分 类 号:TP302[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象