基于训练集局部加权的C4.5算法改进研究  

A Algorithm of Improvement forC4.5Based on Training SetWeightedLocally

在线阅读下载全文

作  者:张扬武 ZHANG Yang-wu (Department of Teachingfor Science and Technology, China University of Political Science and Law, Beijing 102249, China)

机构地区:中国政法大学科学技术教学部,北京102249

出  处:《电脑知识与技术》2016年第6期202-204,共3页Computer Knowledge and Technology

摘  要:C4.5算法采用信息增益率来构造决策树,克服了选择较多值的属性的趋向,具有处理连续属性的能力。在处理大数据集时,表现出效率较低,忽略样本集中的不同样本与测试数据的距离差异。该文提出了一种基于训练集局部加权的C4.5改进算法,根据欧式距离或汉明距离来定义样本的权值,将权值更新到训练集中,重新计算的信息增益率反映了训练样本的差异对测试数据的影响,并且在处理大数据集时,根据权值排序和设置的阈值简化数据集,降低了计算复杂度,提高效率。C4.5 algorithm uses information gain-ratio to construct a decision tree, and overcome the tendency to select the attri- bute onmore values, with the ability to handle continuous attributes.But it showless efficient when dealing with large data sets and ignoring the differences of distance from the sample set and test data set.Based on training set weighted locally, This paper proposes a suite of algorithm of improvement for C4.5algorithm.The sample weights ,which are defined according to the Euclid- ean distance or Hamming distance, update to the training set.On this basis, information gain-ratio which is recalculated reflects the impact of the differences of distance from the sample set and test data set.Therefore, the proposed algorithm can reduces the computational complexity and improves efficiencywhen dealing with large data sets,using the simplifiedsample set based on- weight sorting and the threshold.

关 键 词:C4.5 信息增益比 局部加权 数据集 邻近距离 

分 类 号:TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象