两层聚类的类别不平衡数据挖掘算法  被引量:6

Two-tier Clustering for Mining Imbalanced Datasets

在线阅读下载全文

作  者:胡小生[1] 张润晶[2] 钟勇[1] 

机构地区:[1]佛山科学技术学院电子与信息工程学院,佛山528000 [2]佛山科学技术学院信息与教育技术中心,佛山528000

出  处:《计算机科学》2013年第11期271-275,共5页Computer Science

基  金:佛山市科技发展专项资金项目(2011AA100061);佛山市产学研专项资金项目(2012 HC100272);佛山市教育局智能教育评价指标体系研究项目(DX20120220)资助

摘  要:类别不平衡数据分类是机器学习和数据挖掘研究的热点问题。传统分类算法有很大的偏向性,少数类分类效果不够理想。提出一种两层聚类的类别不平衡数据级联挖掘算法。算法首先进行基于聚类的欠采样,在多数类样本上进行聚类,之后提取聚类质心,获得与少数类样本数目相一致的聚类质心,再与所有少数类样例一起组成新的平衡训练集,为了避免少数类样本数量过少而使训练集过小导致分类精度下降的问题,使用SMOTE过采样结合聚类欠采样;然后在平衡的训练集上使用K均值聚类与C4.5决策树算法相级联的分类方法,通过K均值聚类将训练样例划分为K个簇,在每个聚类簇内使用C4.5算法构建决策树,通过K个聚簇上的决策树来改进优化分类决策边界。实验结果表明,该算法具有处理类别不平衡数据分类问题的优势。Classification of class-imbalanced data becomes a research hot topic in machine learning and data mining. Most classification algorithms tend to predict that most of the incoming data belongs to the majority class, resulting in the pool classification performance in minority class instances, which are usually much more of interest. In this paper, a two-tier clustering cascading mining algorithm was proposed. The algorithm first constructs balanced training set by clusterd-based under-sampling, using K-means clustering to cluster majority class and extract cluster centroids then merge with all minority class instances to generate a balanced training set for training. To avoid the number of the mi- nority is too small, leading the shortage of training instance, combination of SMOTE over-sampling and cluster-based under-sampling is used~ next, using "K-means-t-CA. 5", a method to cascade K-means clustering and CA. 5 decision tree algorithm for classifying on the balanced training set, the K-means clustering method is first used to parition the training instances into k clusters, and on each cluster, CA. 5 algorithm is used to build decision tree, the decision tree on each cluster refines the decision boundaries by learning the subgroups within the cluster. Experimental results show that the proposed method provides better classification performance than other approaches on both minority and majority clas- ses,and is effective and feasible to deal with the imbalanced datasets.

关 键 词:数据挖掘 分类 不平衡数据 K均值聚类 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象