基于非均衡数据集的新型混合重取样算法  被引量:1

Novel Hybrid Re-sampling Algorithm Based Imbalanced Data Sets

在线阅读下载全文

作  者:谷琼[1] 王贤明[2] 李文新[1] 

机构地区:[1]襄樊学院数学与计算机科学学院,湖北襄樊441053 [2]温州大学瓯江学院,浙江温州325035

出  处:《武汉理工大学学报》2010年第20期55-60,共6页Journal of Wuhan University of Technology

基  金:国家高技术研究发展863计划项目(2009AA12Z117);襄樊学院规划项目(2009YA012)

摘  要:在分析重取样技术的基础上,设计并实现了自适应选择近邻的混合重取样算法。该方法结合过取样和欠取样方法的优势,改进了SMOTE过取样算法在产生合成样本过程中存在的盲目性及只能复制生成数值属性的问题,新算法能根据实例样本集内部分布的真实特性,自适应调整近邻选择策略,对不同属性的数据采取不同的复制方法生成新的少数类实例,控制和提高合成样本的质量;并通过对合成之后的数据集用改进的邻域清理方法进行适当程度欠取样,去掉多数类中的冗余实例和边界上的噪音数据,减少其规模,在一定程度上达到相对均衡,从而可有效地处理非均衡数据分类问题,提高分类器的性能。On the basis of analyzing re-sampling technology,a novel hybrid re-sampling technique based on Automated Adaptive Selection of the Number of Nearest Neighbors (ADSNNHRS) is proposed.This method in fact is combining the advantages of both technology of improved Synthetic Minority Over-sampling Technique(SMOTE) method with neighborhood cleaning rule(NCL) data cleaning method.In our procedure of over-sampling,in the SMOTE method,blindfold new synthetic minority class examples by randomly interpolating pairs of closest neighbors are added into the minority class;and data sets with nominal features can not be handled,these two problems are solved by the automated adaptive selection of nearest neighbors and adjusting the neighbor selective strategy.As a consequence,the quality of the new samples can be well controlled.In the procedure of under-sampling,by using the improved under-sampling technique of neighborhood cleaning rule,borderline majority class examples and the noisy or redundant data are removed.The main motivation behind these methods is not only to balance the training data,but also to remove noisy examples lying on the wrong side of the decision border.The removal of noisy examples might aid in finding better-defined class clusters,therefore,allowing the creation of simpler models with better generalization capabilities,therefore,promising effective processing of IDS and a considerably enhanced classifier performance.

关 键 词:非均衡数据集 重取样 机器学习 分类 

分 类 号:TP311.1[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象