基于双决策树的数据采样方法  被引量:9

A data sampling method based on double decision tree

在线阅读下载全文

作  者:陈力 费洪晓[2] 丁海伦 成琳 翟纪宇 CHEN Li;FEI Hong-xiao;DING Hai-lun;CHENG Lin;ZHAI Ji-yu(School of Geosciences and Info-Physics,Central South University,Changsha 410075;School of Software,Central South University,Changsha 410075,China)

机构地区:[1]中南大学地球科学与信息物理学院,湖南长沙410075 [2]中南大学软件学院,湖南长沙410075

出  处:《计算机工程与科学》2019年第1期130-135,共6页Computer Engineering & Science

基  金:国家自然科学基金(61602525);中南大学2017年本科生自由探索项目(201710533267;ZY20170769)

摘  要:在数据挖掘问题中,一个基本假设是训练集样本与测试集样本的数据分布一致,但随着数据量逐渐增加,如何在海量数据中找出具有代表意义的数据也变得尤为困难。对现有的数据选择方法研究发现,传统的简单随机抽样和渐进抽样等数据选择方法,由于没有和数据挖掘工具进行结合,采样结果具有偶然性和不确定性,抽样数据很难保证数据挖掘的基本假设,这也使得最终模型的泛化误差较大。为了解决数据采样过程中类间的不平衡问题,提出一种基于双决策树的结构化数据采样方法。首先通过C4.5算法生成一棵决策树,借助决策树在数据源中选择适合的数据和数据采集点,同时通过使用另一棵决策树对选择出的数据集的质量进行评估来达到高效率和高质量的数据采样。实验表明,与简单随机抽样相比,新采样数据下训练的模型准确率有明显提高。In data mining, a basic assumption is that the data distribution of training set samples are consistent with that of test set samples. But as data volumes increase, how to find out representative data in huge amounts of data becomes particularly difficult. By studying existing data selection methods, we find that it is difficult to evaluate their sampling effect because they are not integrated with the data mining tool, such as simple random sampling and progressive sampling. Due to contingency factors and uncertainty, it is difficult to guarantee the basic assumptions of data mining, which also makes the generalization error of the model larger. In order to solve these problems, we put forward a structured data sampling method based on double decision tree. Firstly, we generate a decision tree with the C4.5 algorithm, which is used to select appropriate data and data collection points in the data source. Then, we generate another decision tree to evaluate the quality of the selected data set and achieve data sampling of high efficiency and high quality. Experiments show that compared with random sampling, the accuracy of the model based on our sampling is improved obviously.

关 键 词:决策树 数据采样 机器学习 

分 类 号:TP311.5[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象