检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]中国科学院上海微系统与信息技术研究所,上海200050 [2]无锡物联网产业研究院,江苏无锡214135
出 处:《浙江大学学报(工学版)》2013年第6期944-950,共7页Journal of Zhejiang University:Engineering Science
基 金:国家"973"重点基础研究发展规划资助项目(2011CB302906);国家科技重大专项基金资助项目(2010ZX03006-004)
摘 要:针对传统SMOTE过采样方法在生成合成样本的过程中存在的盲目性,以及对噪声敏感且容易出现过拟合现象的问题,提出一种改进的聚类边界样本过采样(CB-SMOTE)方法,通过引入"聚类一致性系数"找到少数类样本的边界,利用边界样本的最近邻密度来剔除噪声点和确定合成样本的数量,对SMOTE方法的新样本合成规则进行了优化.该方法是一种指导性的过采样方法,合成样本更加有利于分类器的学习.通过实验对比6种不同方法在UCI公共数据集上的分类性能,结果表明:CB-SMOTE方法对少数类样本和多数类样本都具有较高的分类准确率,且对过采样倍数的变化具有更高的稳定性.The synthetic minority over-sampling technique(SMOTE) is a widely used method for imbalanced data classification.However,SMOTE synthesizes new samples without any guidance,which may lead to noise-sensitive and over-fitting.To resolve this problem,a novel over-sampling classification method for imbalanced data sets,called cluster boundary-synthetic minority over-sampling technique(CB-SMOTE),was proposed.Clustering consistency index was introduced to find the boundary minority samples.Then,k-nearest density was defined to calculate the number of synthetic new samples and to reject the noise samples,and it modified the rule of new samples synthesis.It is an over-sampling method with guidance,and the new samples generated by this method are much more beneficial for classifier learning.Six classification methods were compared using University of California Irvine(UCI) data sets.Experimental results show that the proposed method outperforms other methods in both minority samples and majority samples,and it is more stable in different over-sampling rates.
关 键 词:不平衡数据 过采样 聚类边界 最近邻密度 合成样本
分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.42