检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张元鸣[1] 陈苗[1] 陆佳炜[1] 徐俊[1] 肖刚[1]
机构地区:[1]浙江工业大学计算机科学与技术学院,浙江杭州310023
出 处:《计算机工程与科学》2017年第5期841-848,共8页Computer Engineering & Science
基 金:浙江省重大科技专项(2014C01408);浙江省公益性技术项目(2017C31014)
摘 要:针对经典C4.5决策树算法存在过度拟合和伸缩性差的问题,提出了一种基于Bagging的决策树改进算法,并基于MapReduce模型对改进算法进行了并行化。首先,基于Bagging技术对C4.5算法进行了改进,通过有放回采样得到多个与初始训练集大小相等的新训练集,并在每个训练集上进行训练,得到多个分类器,再根据多数投票规则集成训练结果得到最终的分类器;然后,基于MapReduce模型对改进算法进行了并行化,能够并行化处理训练集、并行选择最佳分割属性和最佳分割点,以及并行生成子节点,实现了基于MapReduce Job工作流的并行决策树改进算法,提高了对大数据集的分析能力。实验结果表明,并行Bagging决策树改进算法具有较高的准确度与敏感度,以及较好的伸缩性和加速比。In order to address the shortcomings of overfitting and poor scalability of the C4.5 decision tree algorithm, we propose an optimized C4.5 algorithm with Bagging technique, and then parallelize it according to the MapReduce model. The optimized algorithm can obtain multiple new training sets that are equal to the initial training set by sampling with replacement. Multiple classifiers can be obtained by training the algorithm with these new trainingsets. A final classifier is generated according to a majority voting rule that integrates the training results. Then, the optimized algorithm is parallelized in three as- pects, including parallel processing training sets, parallel selecting optimal decomposition attributes and optimal decomposition point, and parallel generating child nodes. A parallel algorithm based on job workflow is implemented to improve the ability of big data analysis. Experimental results show that the parallel and optimized decision tree algorithm has higher accuracy, higher sensitivity, better scalability and higher performance.
关 键 词:决策树 BAGGING MAPREDUCE模型 大数据分析 准确性
分 类 号:TP390.027[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.226.181.89