检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:程稳 李焱 曾令仿 王芳 唐士程 杨力平 冯丹 曾文君 CHENG Wen;LI Yan;ZENG Ling-fang;WANG Fang;TANG Shi-cheng;YANG Li-ping;FENG Dan;ZENG Wen-jun(Wuhan National Laboratory for Optoelectronics,Key Laboratory of Information Storage System,Engineering Research Center of Data Storage Systems and Technology,Huazhong University of Science and Technology,Ministry of Education of China,Wuhan 430074,China;China National GeneBank,BGI-Shenzhen,Shenzhen,Guangdong 518120,China;Zhejiang Lab,Hangzhou 311121,China)
机构地区:[1]华中科技大学武汉光电国家研究中心信息存储系统教育部重点实验室暨数据存储系统与技术教育部工程研究中心,武汉430074 [2]深圳国家基因库,广东深圳518120 [3]之江实验室,杭州311121
出 处:《计算机科学》2022年第10期1-9,共9页Computer Science
基 金:国家自然科学基金重点项目(61832020);国家自然科学基金创新研究群体项目(61821003);之江实验室中心自设科研项目(2021DA0AM01)。
摘 要:集群存储系统的错误日志信息有助于优化存储系统的可用性和稳定性。现有存储系统错误探究主要针对单机存储系统或集群存储系统的部分功能进行分析评估,缺乏在实际应用场景下,同一生产环境中,长时间、多视角的探究工作。新型功能模块的不断融入,使得集群存储系统日益庞杂,集群存储系统自身引发的错误层出不穷,给各类研发人员带来了困扰与挑战。针对以上问题,提出了面向Lustre集群存储的错误日志分析及系统优化策略,通过收集连续1 673天的错误日志,研究了近2.26 GB的Lustre错误日志,分析了多个版本Lustre错误的特点与问题,揭示了集群存储系统各方面的不足与错误,研究了不同Lustre版本错误的影响因素,总结了Lustre集群在实际生产环境中的常见错误,并给出了相应的解决方案。对Lustre系统研发有了许多新的见解,并总结了14个发现,最后通过采集333天的新增错误记录对14个发现进行了相关验证,给出了一些系统错误优化实例。相关测试表明,优化实例可以显著减少错误数量,提高系统的可用性和稳定性,研究结果和建议对集群存储系统本身的发展以及集群存储系统的运行和维护都有一定的参考价值。Cluster storage system error messages can help to optimize the availability and reliability of storage system.Previous research of storage system error analysis focuses on the local file system or a part of the cluster storage system.There is a lack of research on storage system error messages for a long-time and multi-dimension in practical applications.With the continuous integration of new functional modules, the cluster storage system is becoming more and more complex, and the errors caused by cluster storage system emerge endlessly, which brings troubles and challenges to the researcher and developer.To address the pro-blems, we conduct a comprehensive study of the Lustre system error log.By collecting the error log in 1 673 consecutive days, we study nearly 2.26 GB of Lustre error logs, analyze the characteristics and problems of the Lustre system errors in multiple Lustre versions.We show that correlated errors between different subsystems and study the possible impacting factors on different Lustre versions.We also summarize the common errors in the Lustre system and show the corresponding solutions.We derive nume-rous new insights into the Lustre system development process and report 14 findings.Finally, we collect new error logs for 333 consecutive days to verify the 14 findings and give some cases about error optimization.Experimental results show that the error optimization cases can significantly reduce the number of errors and improve the availability and stability of the system.Our results and suggestions should be useful for both the development of the cluster storage system themselves as well as the Lustre operation and maintenance.
关 键 词:LUSTRE文件系统 日志分析 系统优化 错误 可靠性
分 类 号:TP399[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.128.247.220