面向Lustre集群存储的错误日志分析及系统优化  

Error Log Analysis and System Optimization for Lustre Cluster Storage

在线阅读下载全文

作  者:程稳 李焱 曾令仿 王芳 唐士程 杨力平 冯丹 曾文君 CHENG Wen;LI Yan;ZENG Ling-fang;WANG Fang;TANG Shi-cheng;YANG Li-ping;FENG Dan;ZENG Wen-jun(Wuhan National Laboratory for Optoelectronics,Key Laboratory of Information Storage System,Engineering Research Center of Data Storage Systems and Technology,Huazhong University of Science and Technology,Ministry of Education of China,Wuhan 430074,China;China National GeneBank,BGI-Shenzhen,Shenzhen,Guangdong 518120,China;Zhejiang Lab,Hangzhou 311121,China)

机构地区:[1]华中科技大学武汉光电国家研究中心信息存储系统教育部重点实验室暨数据存储系统与技术教育部工程研究中心,武汉430074 [2]深圳国家基因库,广东深圳518120 [3]之江实验室,杭州311121

出  处:《计算机科学》2022年第10期1-9,共9页Computer Science

基  金:国家自然科学基金重点项目(61832020);国家自然科学基金创新研究群体项目(61821003);之江实验室中心自设科研项目(2021DA0AM01)。

摘  要:集群存储系统的错误日志信息有助于优化存储系统的可用性和稳定性。现有存储系统错误探究主要针对单机存储系统或集群存储系统的部分功能进行分析评估,缺乏在实际应用场景下,同一生产环境中,长时间、多视角的探究工作。新型功能模块的不断融入,使得集群存储系统日益庞杂,集群存储系统自身引发的错误层出不穷,给各类研发人员带来了困扰与挑战。针对以上问题,提出了面向Lustre集群存储的错误日志分析及系统优化策略,通过收集连续1 673天的错误日志,研究了近2.26 GB的Lustre错误日志,分析了多个版本Lustre错误的特点与问题,揭示了集群存储系统各方面的不足与错误,研究了不同Lustre版本错误的影响因素,总结了Lustre集群在实际生产环境中的常见错误,并给出了相应的解决方案。对Lustre系统研发有了许多新的见解,并总结了14个发现,最后通过采集333天的新增错误记录对14个发现进行了相关验证,给出了一些系统错误优化实例。相关测试表明,优化实例可以显著减少错误数量,提高系统的可用性和稳定性,研究结果和建议对集群存储系统本身的发展以及集群存储系统的运行和维护都有一定的参考价值。Cluster storage system error messages can help to optimize the availability and reliability of storage system.Previous research of storage system error analysis focuses on the local file system or a part of the cluster storage system.There is a lack of research on storage system error messages for a long-time and multi-dimension in practical applications.With the continuous integration of new functional modules, the cluster storage system is becoming more and more complex, and the errors caused by cluster storage system emerge endlessly, which brings troubles and challenges to the researcher and developer.To address the pro-blems, we conduct a comprehensive study of the Lustre system error log.By collecting the error log in 1 673 consecutive days, we study nearly 2.26 GB of Lustre error logs, analyze the characteristics and problems of the Lustre system errors in multiple Lustre versions.We show that correlated errors between different subsystems and study the possible impacting factors on different Lustre versions.We also summarize the common errors in the Lustre system and show the corresponding solutions.We derive nume-rous new insights into the Lustre system development process and report 14 findings.Finally, we collect new error logs for 333 consecutive days to verify the 14 findings and give some cases about error optimization.Experimental results show that the error optimization cases can significantly reduce the number of errors and improve the availability and stability of the system.Our results and suggestions should be useful for both the development of the cluster storage system themselves as well as the Lustre operation and maintenance.

关 键 词:LUSTRE文件系统 日志分析 系统优化 错误 可靠性 

分 类 号:TP399[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象