超大规模计算集群监控系统的设计与实现  被引量:1

The Design and Implementation of a Monitoring System for Super-Large Computing Cluster

在线阅读下载全文

作  者:彭亮 牛铁[1] 魏宝亮 赵毅[1] PENG Liang;NIU Tie;WEI Baoliang;ZHAO Yi(Computer Network Information Center,Chinese Academy of Sciences,Beijing 100083,China)

机构地区:[1]中国科学院计算机网络信息中心,北京100083

出  处:《数据与计算发展前沿》2023年第1期97-103,共7页Frontiers of Data & Computing

基  金:中国科学院战略性先导科技专项项目(A类)(XDA19020101)。

摘  要:【背景】传统集群监控软件在性能、灵活性、可扩展性上无法满足超过10000节点的超大规模计算集群以及多集群系统的监控管理需求。【目的】亟需设计研发新型集群监控系统,提升超大规模计算集群和多集群的运行管理能力与效率。【方法】本文采用总分架构设计,利用消息中间件、分布式存储、REST技术实现了一种超大规模计算集群监控系统。【结果】该系统支持监控指标自定义、数据主动上发、自动告警等功能,具有良好的横向扩展能力。已部署于多套计算集群中,满足上万节点和设备的监控需求,日均采集数据逾200GB。【局限】由于监控指标繁多、监控数据量庞大,针对业务场景的数据关联分析能力有待提升。【结论】本文工作满足了超大规模计算集群及异地多集群系统的自动运管需求,采用的方法对更大规模集群甚至E级计算系统的运管工具的研发具有积极借鉴意义。[Background]The traditional cluster monitoring systems cannot meet the requirements of multiclusters and super-large-scale clusters with more than 10000 nodes in performance,flexibility,and scalability.[Objective]It is urgent to develop a new monitoring system to improve the management capability and efficiency for these kinds of clusters.[Methods]This paper adopts message-oriented middleware,distributed monitoring architecture,and REST API to realize a monitoring system for above-mentioned clusters.[Results]The system supports the functions of self-definable metrics,real-time active data sending,and automatic alarm,and is of good extensibility.The system has been deployed in several computing clusters and fits the monitoring needs of the cluster with more than 10000 nodes and devices.The amount of daily data collection is more than 200 GB.[Limitations]Due to numerous kinds of monitoring metrics and mass monitoring data,the data correlation analysis ability for specific business scenarios needs to be improved.[Conclusions]The work presented in this paper meets the need for automatic management of the super-large computing cluster and the multicluster systems.It can be a reference in developing the management tools for even larger computing clusters and for the exascale computing systems.

关 键 词:超大规模 计算 集群 HPC 监控 

分 类 号:TP277[自动化与计算机技术—检测技术与自动化装置] TP38[自动化与计算机技术—控制科学与工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象