面向不平衡数据集的一种基于聚类的欠采样方法  被引量:11

An under-sampling Method Based on Clustering for Imbalanced Data Set

在线阅读下载全文

作  者:李春雪 谢林森[2] 卢诚波[2] LI Chun-xue;XIE Lin-sen;LU Cheng-bo(School of Mathematics and Information Engineering,Zhejiang Normal University,Jinhua 321004,China;Faculty of Engineering,Lishui University,LIshui 323000,China)

机构地区:[1]浙江师范大学数理与信息工程学院,浙江金华321004 [2]丽水学院工学院,浙江丽水323000

出  处:《数学的实践与认识》2019年第1期203-209,共7页Mathematics in Practice and Theory

基  金:国家自然科学基金(11771194);浙江省自然科学基金(LY18F030003);丽水市高层次人才资助项目(2017RC01)

摘  要:针对不平衡数据集分类问题,提出了一种基于聚类的欠采样方法.分别取不同的聚类个数,对训练集中的多数类样本进行若干次聚类,然后用聚类中心作为多数类样本,与少数类样本构成若干个新的训练集,之后用这些训练集训练分类器,剔除具有错误分类倾向的分类器,最后对分类结果进行投票.仿真实验对几种欠采样方法进行比较.实验采用16个平衡率不一的数据集进行测试.理论分析与实验结果表明:提出的基于聚类的欠采样方法能有效地改善不平衡数据集的不平衡性.In order to solve the classification problem of imbalanced data, we proposed an under-sampling method based on clustering method. Taking different number of clusters, the majority samples in the training set were clustered for several times. Then the cluster centers were used to represent the majority class. Next, the cluster centers were combined with the minority samples into a number of new training sets. Then the training sets were used to train classifiers and eliminate the classifiers with false classification tendency. Finally, we voted on the results of the classification. In experiments, a lot of simulations on 16 imbalanced data sets had been conducted and the proposed algorithm had been compared with some other under-sampling algorithms. The theoretical analysis and experimental results showed that the algorithm could improve the classification performance of imbalance data sets effectively.

关 键 词:不平衡数据集 K-MEANS 欠采样 K-近邻 支持向量机 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象