检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:刘鹏[1]
机构地区:[1]上海财经大学经济信息管理系,上海200433
出 处:《计算机工程与应用》2005年第33期172-175,共4页Computer Engineering and Applications
摘 要:论文提出了一种健壮有效的决策树改进模型R-C4.5及其简化版本。该决策树模型基于著名的C4.5决策树模型,但在属性的选取和分枝策略上进行了改进。对每一个属性计算对应样本子集的熵和样本子集熵的平均值,并将样本子集熵的值不小于平均值的样本子集进行合并,形成一个临时的复合样本子集,即合并分类效果较差的分枝。利用临时复合样本子集的熵值和未合并样本子集的熵值计算该结点的修正信息增益,并选择具有最高修正信息增益的属性作为当前结点的测试属性。其分枝对应于未合并样本子集和复合样本子集。该模型的简化版本在数据预处理阶段完成。R-C4.5决策树模型在提高测试属性选择度量的可解释性、减少空枝和无意义分枝,及过度拟合等方面有了显著的提高。In this paper,a robust and effective decision tree improved model R-C4.5 and its simplified version are introduced.The model is based on C4.5,but it is improved in attribution selection and partitioning rules.In R-C4.5,the branches which have poor appearances in classification are united.In the first step,we calculate the entropies of every attribute and average entropy according to subset of samples.Then,the subsets whose entropies are not less than the average are united.In the next step ,we calculate the modified information gain of certain node using temporary complex subset and the other subsets,and attribute with the highest information gain is chosen as the test attribute for the current node.The simplified version of R-C4.5 model is implemented in data preprocessing.R-C4.5 enhances interpretability of test attribute selection,reduces the number of empty or insignificant branches and avoid the appearance of over fitting.
关 键 词:决策树模型 C4.5 R—C4.5 分类器 数据挖掘
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.249