面向不平衡数据分类的DPC-SMOTE过采样算法  

DPC-SMOTE Over-sampling Algorithm for Imbalanced Data Classification

在线阅读下载全文

作  者:刘志函 张忠林[1] 赵磊 LIU Zhihan;ZHANG Zhonglin;ZHAO Lei(College of Electronic and Information Engineering,Lanzhou Jiaotong University,Lanzhou 730070,China)

机构地区:[1]兰州交通大学电子与信息工程学院,兰州730070

出  处:《哈尔滨理工大学学报》2024年第6期45-60,共16页Journal of Harbin University of Science and Technology

基  金:国家自然科学基金(61662043);甘肃省自然科学基金(21JR7RA288).

摘  要:针对不平衡数据集中存在的噪声以及类内类间不平衡问题,提出了基于密度峰值聚类过采样算法。首先对多数类样本进行预处理,筛选噪声样本并删除;其次,对所有少数类样本采用密度峰值聚类,剔除噪声点;再次,根据聚类后每个簇不同的稀疏度分配采样权重,并计算每个簇需要合成的新样本数目;最后在每个簇内进行SMOTE过采样合成新样本。将提出的过采样算法与5种常用过采样算法对比,并分别与5种基分类器相结合,在10个不平衡数据集上进行对比实验。实验结果表明:本文方法的F_(1)、G-mean、AUC分别最低可提升1.21%、0.94%、5.14%,最高可提升15.90%、14.99%、11.26%;证明该方法能够减少样本重叠,有效避免不平衡数据集中噪声的产生,提升了分类精度。An oversampling algorithm based on density peak clustering is proposed to solve the problem of noise and imbalance among classes in imbalanced data sets.Firstly,most of the samples are preprocessed,and the noise samples are screened and deleted.Secondly,the algorithm adopts density peak clustering for all minority samples and removes noise points.Then the sampling weights are assigned according to the different sparsity of each cluster,and the number of new samples to be synthesized for each cluster is calculated.SMOTE oversampling is performed in each cluster to synthesize new samples.The proposed oversampling algorithm is compared with five common oversampling algorithms.It is combined with five base classifiers respectively,and comparison experiments are carried out on six imbalanced data sets.The experimental results show that F_(1),G-mean and AUC of this method can increase by 1.21%,0.94%and 5.14%at least.The maximum increase can be 15.90%,14.99%,11.26%.It is proved that this method can reduce sample overlap,effectively avoid noise generation in imbalanced data sets,and improve classification accuracy.

关 键 词:不平衡数据 分类 过采样 密度峰值聚类 稀疏度 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象