决策树模型预测Spark SQL作业执行时间的方法

METHOD OF PREDICTING SPARK SQL JOB EXECUTION TIME BY DECISION TREE MODEL

作　　者：吴恩慈 Wu Enci(Shanghai Qiyu Information Technology Co.,Ltd.,Shanghai 200120,China)

机构地区：[1]上海淇毓信息科技有限公司,上海200120

出　　处：《计算机应用与软件》2021年第4期24-31,123,共9页Computer Applications and Software

摘　　要：Spark SQL在超大规模集群和数据集上存在易用性问题,如Catalyst最优执行计划的选择,Shuffle Partition的配置对性能有较大的影响,数据倾斜往往导致集群性能变差。为了在作业执行之前准确预测执行时间,更加充分地使用运行时数据,选择最优执行计划,提出通过决策树及其组合算法的回归模型预测作业执行时间的方法。采用交叉验证方法优化模型超参数,通过剪枝和组合算法优化过度拟合问题,选择相关指标评估机器学习模型预测的准确性。实验表明,梯度提升树回归模型预测作业执行时间的R 2超过0.8,且能够满足在线预测的实时性要求,模型评估指标达到预期效果,相对于线性回归模型的评估指标具有一定的优势。Spark SQL implements high-speed computing and complex data mining,but there are problems with ease of use on very large clusters and datasets.As with the choice of Catalyst optimal execution plan,the configuration of Shuffle Partition has a large impact on performance,and data skew often leads to poor cluster performance.The purpose of this paper is to accurately predict execution time before the job is executed,to use the runtime data more fully,and to select the best execution plan.A regression model for predicting job execution time by decision tree and its combination algorithm is proposed.The cross validation method was used to optimize the model parameters.The pruning and combination algorithm was used to optimize the over-fitting problem,and the relevant indicators were selected to evaluate the accuracy of the machine learning model.The experiment shows that Gradient Boosting decision tree model predicts that the R 2 of the execution time of the job exceeds 0.8,and it can meet the real-time requirements of online prediction.The model evaluation index achieves the expected effect,and has certain advantages over the evaluation index of the linear regression model.

关键词：任务调度计算引擎作业特征执行时间预测模型决策树

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

决策树模型预测Spark SQL作业执行时间的方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

决策树模型预测Spark SQL作业执行时间的方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索