基于ADASYN和WGAN的混合不平衡数据处理方法  被引量:1

Hybrid imbalanced data processing based on ADASYN and WGAN

在线阅读下载全文

作  者:周万珍[1,2] 盛媛媛 张永强 马金龙[1,2] ZHOU Wanzhen;SHENG Yuanyuan;ZHANG Yongqiang;MA Jinlong(School of Information Science and Engineering,Hebei University of Science and Technology,Shijiazhuang,Hebei 050018,China;Hebei Technology Innovation Center of Intelligent IoT,Shijiazhuang,Hebei 050018,China)

机构地区:[1]河北科技大学信息科学与工程学院,河北石家庄050018 [2]河北省智能物联网技术创新中心,河北石家庄050018

出  处:《河北工业科技》2024年第4期291-298,共8页Hebei Journal of Industrial Science and Technology

基  金:河北省自然科学基金(F2022208002);河北省高等学校科学技术研究重点项目(ZD2021048)。

摘  要:为了解决不平衡数据集中少数类样本分类精度较低的问题,提出了一种处理不平衡数据集的ADASYN-WGAN方法。首先,采用ADASYN(adaptive synthetic sampling)算法生成少数类样本,用这些生成样本代替WGAN(wasserstein generative adversarial networks)中的随机噪声;其次,利用WGAN算法生成符合原始数据集分布规律的少数类样本,构建平衡数据集;然后,在6个公开数据集上,采用随机森林分类器对所提方法和4种过采样算法得出的处理结果分别与原始数据集进行对比;最后,通过F1-Score,G-mean和AUC等分类评估指标的表现验证所提方法的有效性。结果表明:在对比实验中,经过ADASYN-WGAN方法得到的平衡数据集在随机森林分类器的十折交叉验证中,4个公开数据集中的各项分类评估指标值均达到最优,虽然另2个公开数据集中的AUC值略低,但其F1-Score和G-mean取得了最高值。所提出的ADASYN-WGAN方法可生成高质量的数据样本,并可为解决不平衡数据集中少数类样本的预测偏差问题提供参考。In order to solve the problem of low classification accuracy of minority class samples in imbalanced datasets,an ADASYN-WGAN method was proposed to deal with imbalanced datasets.Firstly,the minority class samples were generated using the ADASYN algorithm,and these generated samples were used to replace the random noise in the WGAN;Secondly,the minority class samples conforming to the distribution law of the original dataset were generated using the WGAN algorithm to construct the balanced dataset;Then,the processing results derived from the proposed method and the four over-sampling algorithms were compared with the original dataset using the random forest classifier on six public datasets,respectively.Finally,the effectiveness of the proposed method was verified by the performance of classification assessment indexes such as F1-Score,G-mean and AUC.The results show that in the comparison experiments,the balanced dataset obtained by the ADASYN-WGAN method achieves the optimal values of all classification assessment indexes in four public datasets in the ten-fold cross-validation of the random forest classifier,and the F1-Score and G-mean achieve the highest values in the other two public datasets,although the AUC values are slightly lower.The proposed ADASYN-WGAN method can generate high-quality data samples and provide reference for solving the problem of prediction bias for a few class samples in unbalanced datasets.

关 键 词:数据处理 不平衡数据 WGAN ADASYN 过采样方法 随机森林 

分 类 号:TP399[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象