检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王焘 顾泽宇[1,2] 张文博 徐继伟[1,2] 魏峻 钟华[1,2] WANG Tao;GU Ze - Yu;ZHANG Wen-Bo;XU Ji-Wei;WEI Jun;ZHONG Hua(State Key Laboratory of Computer Science,Beijing 100190;Institute of Software,Chinese Academy of Sciences,Beijing 10019)
机构地区:[1]计算机科学国家重点实验室,北京100190 [2]中国科学院软件研究所,北京100190
出 处:《计算机学报》2018年第6期1332-1345,共14页Chinese Journal of Computers
基 金:国家自然科学基金(61402450);北京市自然科学基金(4154088);CCF-启明星辰"鸿雁"科研资助计划(CCF-VenustechRP2016007);国家科技支撑计划(2015BAH55F02);国家"八六三"高技术研究发展计划项目(2013AA041301)资助~~
摘 要:监测技术是保障云计算系统性能与可靠性的关键,管理员通过分析监测数据可以了解系统运行状态,从而采取措施以及早发现并解决问题.然而,云计算系统规模巨大,结构复杂,大量的监测数据需要搜集、传输、存储和分析,给系统带来巨大性能开销.那么,如何在提高故障检测的准确性和及时性的同时,减少监测开销成为亟待解决的问题.为了应对以上问题,该文提出一种基于自适应监测的云计算系统故障检测方法.首先,利用相关分析建立度量间的相关性,利用度量关联图选择关键度量进行监测;而后,利用主成分分析得到监测数据的主特征向量以刻画系统运行状态,进而基于余弦相似度评估系统异常程度;最后,建立可靠性模型以预测系统可能出现故障的时间,基于此动态调整监测周期.实验结果表明,该文所提出的方法能够适应云环境下负载的动态变化,准确评估系统异常程度,自动调整监测频率以提高系统在异常状况下故障检测的准确性与及时性,降低系统在正常运行过程中的监测开销.Monitoring is the key technology of guaranteeing the performance and reliability of distributed systems.By analyzing monitoring data,administrators can understand the systems' status to detect,diagnose and solve problems.However,the procedure of collecting,transmitting,storing and analyzing a large amount of monitoring data from large-scale cloud computing systems introduces enormous performance overhead.To address the above issue,this paper proposes an adaptive monitoring approach for fault detection.First,we conduct correlation analysis between different metrics to construct an undirected correlation graph,and monitor only selected important metrics from the graph,which can represent the other ones and reflect the running status of the whole system.Second,we use Principal Component Analysis(PCA)to characterize the running status based on the monitoring data from a sliding window to estimate the abnormality degree and predict the possibility of system faults by comparing the current and the historical collected monitoring data.Finally,we dynamically adjust the monitoring periodbased on the estimated abnormality degree and a reliability model.To evaluate our proposal,we have applied the approach in a TPC-W benchmark deployed in our cloud computing platform.The experimental results demonstrate that the approach can adapt to the dynamic workload fluctuation,accurately estimate the abnormality degree,and automatically adjust the monitoring frequency.Thus,the approach can effectively improve the accuracy and timeliness of fault detection in the abnormal status,and efficiently lower the monitoring overhead in the normal status.
关 键 词:故障检测 自适应监测 云计算 相关分析 主成分分析
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.136.11.217