云环境下基于统计监测的分布式软件系统故障检测技术研究  被引量:32

A Survey of Fault Detection for Distributed Software Systems with Statistical Monitoring in Cloud Computing

在线阅读下载全文

作  者:王焘[1] 张文博[1] 徐继伟[1] 魏峻[1] 钟华[1] 

机构地区:[1]中国科学院软件研究所,北京100190

出  处:《计算机学报》2017年第2期397-413,共17页Chinese Journal of Computers

基  金:国家自然科学基金(61402450);北京市自然科学基金(4154088);国家科技支撑计划(2015BAH55F02-4);国家"九七三"重点基础研究发展规划项目基金(2015CB352201);国家"八六三"高技术研究发展计划项目基金(2013AA041301)资助~~

摘  要:越来越多的分布式软件系统部署在公有云计算平台,通过互联网向外提供服务.云计算环境的复杂性、动态性和开放性使得分布式软件系统更易于出现故障,造成服务失效,从而影响大量用户正常使用,甚至造成巨大经济损失.故障检测技术旨在自动及时的检测系统故障的发生,以避免或减少服务失效所带来的损失,是保障分布式软件系统性能与可靠性的关键技术之一.云计算环境对该技术带来了新的挑战,该文首先分析了这些挑战.基于统计监测的故障检测技术在线搜集监测数据构建统计模型,并基于该模型对系统运行状态进行分析与预测,具有实时监测分析、自动化检测、无需领域知识等优势,能够满足云环境的需要,因此引起了学术界和工业界的广泛关注.该文提出了面向云计算环境的基于统计监测的分布式软件系统故障管理参考框架,包括分布式监测、监测数据处理、故障检测、故障诊断以及故障处理等模块;将已有工作分成基于规则、度量分析、日志分析和行为分析等四大类,逐类介绍其实现原理,并对比分析各类的优缺点;针对当前云计算环境的特点,从在线自动检测、运行环境感知和组件交互分析等3个方面展望了未来的研究方向.More and more distributed software systems are deployed on public cloud computingplatforms to provide services using the Internet. These systems are prone to many types of faults because of the complexity,dynamism and openness of cloud computing. These faults often lead to service failures affecting a large population of users and even resulting in serious economic loss. Fault detection,which aims at timely accurately detecting faults to avoid failures or reduce economic loss, has become one of the most key technologies for guaranteeing the performance and reliability of distributed software systems. Cloud computing environment has raised great challenges to the fault detection of distributed software systems, and then we first introduce these challenges. The fault detection method based on statistical monitoring builds statistical models with online collected monitoring data to analyze and predict system s ta tu s, which is suitable for cloud computing environment. The method with advantages in online analysis and automatic detection without domain knowledge has widely attracted the attention of industrial and academic communities. Then we propose a reference framework of fault management with statistical monitoring for distributed software systems, which includes distributed monitoring, data processor, fault detection, fault diagnosis and fault actor modules. After th a t, we categorize existing works as rule based, metric analysis, log analysis, and behavior analysis methods. Furthermore, we introduce typical works for each category, and compare these categories in strength and weakness. Finally, we direct the future works in online automatic detection, runtime environment awareness and component interaction analysis.

关 键 词:云计算 软件监测 分布式软件系统 软件故障检测 性能异常检测 统计监测 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象