一种基于MapReduce的C4.5决策树并行化算法  

A Distributed C4.5 Decision Tree Algorithm Based on MapReduce

在线阅读下载全文

作  者:潘俊辉[1] 王辉[1] 张强[1] 王浩畅[1] PAN Junhui;WANG Hui;ZHANG Qiang;WANG Haochang(School of Computer&Information Technology,Northeast Petroleum University,Daqing 163318)

机构地区:[1]东北石油大学计算机与信息技术学院,大庆163318

出  处:《计算机与数字工程》2025年第2期327-331,共5页Computer & Digital Engineering

基  金:大庆市科技局2023年指导性科技项目(编号:zd-2023-38);国家自然科学基金项目(编号:61702093)资助。

摘  要:C4.5决策树是一种用于分类规则提取的有效算法,该算法在对中、小规模数据集进行处理时已取得不错的效果,但将其直接应用到大规模数据集上受到多方面的限制,而MapReduce框架对算法进行分布式实现是非常方便的。由此论文将MapReduce与C4.5决策树相结合,提出了一种基于MapReduce的C4.5决策树并行化算法(MRCTA),该算法通过保留C4.5决策树自身的优点,在决策树的节点的构造中首先利用MapReduce对其分裂属性进行并行计算,然后利用所得的最优分裂属性对数据采用分布式分割完成树子节点的生成,同时为了避免产生过度学习现象在构造中将树深和节点覆盖样本的个数和类别比例作为算法终止的条件。最后通过实验对算法的有效性和效率进行了比较和分析。C4.5 decision tree is an effective algorithm for the extraction of classification rules.The algorithm has achieved good results in the processing of medium and small data sets,but its direct application to large data sets is limited by many aspects,while the distributed implementation of the algorithm by MapReduce framework is very convenient.Thus combining MapReduce with C4.5 decision tree,this paper proposes a distributed C4.5 decision tree algorithm based on MapReduce(MRCTA).The algo⁃rithm through preserving the merits of the C4.5 decision tree itself,in the first place in the structure of the decision tree node uses MapReduce to parallel computing of the splitting attribute,then it uses the optimal split attribute of data with the help of distributed segmentation to generate new nodes,at the same time in order to avoid over-learning,the tree depth and the number of samples coverd by nodes and the proportion of categories are taken as the conditions for the termination of the algorithm.Finally,the effec⁃tiveness and efficiency of the algorithm are compared and analyzed through experiments.

关 键 词:决策树 分布式算法 并行计算 MAPREDUCE 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象