检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:吴恩慈 Wu Enci(Shanghai Qiyu Information Technology Co.,Ltd.,Shanghai 200120,China)
出 处:《计算机应用与软件》2021年第4期24-31,123,共9页Computer Applications and Software
摘 要:Spark SQL在超大规模集群和数据集上存在易用性问题,如Catalyst最优执行计划的选择,Shuffle Partition的配置对性能有较大的影响,数据倾斜往往导致集群性能变差。为了在作业执行之前准确预测执行时间,更加充分地使用运行时数据,选择最优执行计划,提出通过决策树及其组合算法的回归模型预测作业执行时间的方法。采用交叉验证方法优化模型超参数,通过剪枝和组合算法优化过度拟合问题,选择相关指标评估机器学习模型预测的准确性。实验表明,梯度提升树回归模型预测作业执行时间的R 2超过0.8,且能够满足在线预测的实时性要求,模型评估指标达到预期效果,相对于线性回归模型的评估指标具有一定的优势。Spark SQL implements high-speed computing and complex data mining,but there are problems with ease of use on very large clusters and datasets.As with the choice of Catalyst optimal execution plan,the configuration of Shuffle Partition has a large impact on performance,and data skew often leads to poor cluster performance.The purpose of this paper is to accurately predict execution time before the job is executed,to use the runtime data more fully,and to select the best execution plan.A regression model for predicting job execution time by decision tree and its combination algorithm is proposed.The cross validation method was used to optimize the model parameters.The pruning and combination algorithm was used to optimize the over-fitting problem,and the relevant indicators were selected to evaluate the accuracy of the machine learning model.The experiment shows that Gradient Boosting decision tree model predicts that the R 2 of the execution time of the job exceeds 0.8,and it can meet the real-time requirements of online prediction.The model evaluation index achieves the expected effect,and has certain advantages over the evaluation index of the linear regression model.
关 键 词:任务调度 计算引擎 作业特征 执行时间 预测模型 决策树
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.79