针对不平衡数据分类的改进GBDT算法  

Improved GBDT Algorithm for Imbalanced Data Classification

在线阅读下载全文

作  者:李长洪 郑凯 林博宇 LI Changhong;ZHENG Kai;LIN Boyu(School of Computing,South China Normal University,Guangzhou 510631;Network Center,South China Normal University,Guangzhou 510631)

机构地区:[1]华南师范大学计算机学院,广州510631 [2]华南师范大学网络中心,广州510631

出  处:《计算机与数字工程》2024年第7期1932-1937,1943,共7页Computer & Digital Engineering

基  金:中国高校产学研创新基金(编号:2020ITA05033)资助。

摘  要:许多传统的分类算法在处理不平衡数据时,训练出的分类器对多数类别样本预测准确率较高,而对少数类别样本的预测准确率较低。针对这一问题,提出一种改进的梯度提升决策树(GBDT)算法用于处理不平衡数据的二分类问题。数据层面,用自适应综合过采样(Adaptive Synthetic Sampling)技术增加少数类的样本数量。算法层面,将焦点损失函数(Focal Loss)引入到GBDT二分类算法中,增加模型对少数类样本的关注度。并且通过平衡化GBDT内部迭代时的每一次随机子采样,使基分类器的性能更稳定。在10组KEEL不平衡数据集上进行对比实验,实验结果验证了改进的可行性。并且用提出的改进算法与SMOTEBoost、RUSBoost、CUSBoost这三种流行的不平衡数据分类算法进行比较,实验结果表明所提改进算法在其中7组数据集上F1-measure值取得最高,其中6组数据集上G-mean值取得最高,验证了所提改进算法在处理不平衡数据的二分类问题时具有较好的效果。When many traditional classification algorithms deal with imbalanced data,the trained classifiers have higher pre⁃diction accuracy for most class samples and lower prediction accuracy for a few class samples.To solve this problem,an improved GBDT(Gradient Boosting Decision Tree)algorithm is proposed to deal with the binary classification problem of unbalanced data.Consider from the data level,Adaptive Synthetic Sampling(ADASYN)technology is used to increase the number of samples of a few classes.Secondly,at the algorithmic level,the Focal Loss function is introduced into the GBDT binary classification algorithm to in⁃crease the model's attention to a small number of samples.The performance of the base classifier is more stable by balancing each random subsampling in GBDT internal iteration.Comparative experiments are carried out on 10 sets of KEEL imbalanced data sets,and the experimental results verified the feasibility of the improvement.And the proposed improved algorithm is compared with the three popular imbalanced data classification algorithms,SMOTEBoost,RUSBoost,and CUSBoost.The experimental results show that the enhanced algorithm has the highest F1-measure value on seven sets of data and the highest G-mean value on six sets of da⁃ta.It is verified that the proposed improved algorithm has a good effect in dealing with the two classification problems of unbalanced data.

关 键 词:不平衡数据 梯度提升决策树 自适应综合过采样 焦点损失函数 随机子采样 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象