聚类边界过采样不平衡数据分类方法  被引量:31

Clustering boundary over-sampling classification method for imbalanced data sets

在线阅读下载全文

作  者:楼晓俊[1] 孙雨轩[1] 刘海涛[1,2] 

机构地区:[1]中国科学院上海微系统与信息技术研究所,上海200050 [2]无锡物联网产业研究院,江苏无锡214135

出  处:《浙江大学学报(工学版)》2013年第6期944-950,共7页Journal of Zhejiang University:Engineering Science

基  金:国家"973"重点基础研究发展规划资助项目(2011CB302906);国家科技重大专项基金资助项目(2010ZX03006-004)

摘  要:针对传统SMOTE过采样方法在生成合成样本的过程中存在的盲目性,以及对噪声敏感且容易出现过拟合现象的问题,提出一种改进的聚类边界样本过采样(CB-SMOTE)方法,通过引入"聚类一致性系数"找到少数类样本的边界,利用边界样本的最近邻密度来剔除噪声点和确定合成样本的数量,对SMOTE方法的新样本合成规则进行了优化.该方法是一种指导性的过采样方法,合成样本更加有利于分类器的学习.通过实验对比6种不同方法在UCI公共数据集上的分类性能,结果表明:CB-SMOTE方法对少数类样本和多数类样本都具有较高的分类准确率,且对过采样倍数的变化具有更高的稳定性.The synthetic minority over-sampling technique(SMOTE) is a widely used method for imbalanced data classification.However,SMOTE synthesizes new samples without any guidance,which may lead to noise-sensitive and over-fitting.To resolve this problem,a novel over-sampling classification method for imbalanced data sets,called cluster boundary-synthetic minority over-sampling technique(CB-SMOTE),was proposed.Clustering consistency index was introduced to find the boundary minority samples.Then,k-nearest density was defined to calculate the number of synthetic new samples and to reject the noise samples,and it modified the rule of new samples synthesis.It is an over-sampling method with guidance,and the new samples generated by this method are much more beneficial for classifier learning.Six classification methods were compared using University of California Irvine(UCI) data sets.Experimental results show that the proposed method outperforms other methods in both minority samples and majority samples,and it is more stable in different over-sampling rates.

关 键 词:不平衡数据 过采样 聚类边界 最近邻密度 合成样本 

分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象