检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李长洪 郑凯 林博宇 LI Changhong;ZHENG Kai;LIN Boyu(School of Computing,South China Normal University,Guangzhou 510631;Network Center,South China Normal University,Guangzhou 510631)
机构地区:[1]华南师范大学计算机学院,广州510631 [2]华南师范大学网络中心,广州510631
出 处:《计算机与数字工程》2024年第7期1932-1937,1943,共7页Computer & Digital Engineering
基 金:中国高校产学研创新基金(编号:2020ITA05033)资助。
摘 要:许多传统的分类算法在处理不平衡数据时,训练出的分类器对多数类别样本预测准确率较高,而对少数类别样本的预测准确率较低。针对这一问题,提出一种改进的梯度提升决策树(GBDT)算法用于处理不平衡数据的二分类问题。数据层面,用自适应综合过采样(Adaptive Synthetic Sampling)技术增加少数类的样本数量。算法层面,将焦点损失函数(Focal Loss)引入到GBDT二分类算法中,增加模型对少数类样本的关注度。并且通过平衡化GBDT内部迭代时的每一次随机子采样,使基分类器的性能更稳定。在10组KEEL不平衡数据集上进行对比实验,实验结果验证了改进的可行性。并且用提出的改进算法与SMOTEBoost、RUSBoost、CUSBoost这三种流行的不平衡数据分类算法进行比较,实验结果表明所提改进算法在其中7组数据集上F1-measure值取得最高,其中6组数据集上G-mean值取得最高,验证了所提改进算法在处理不平衡数据的二分类问题时具有较好的效果。When many traditional classification algorithms deal with imbalanced data,the trained classifiers have higher pre⁃diction accuracy for most class samples and lower prediction accuracy for a few class samples.To solve this problem,an improved GBDT(Gradient Boosting Decision Tree)algorithm is proposed to deal with the binary classification problem of unbalanced data.Consider from the data level,Adaptive Synthetic Sampling(ADASYN)technology is used to increase the number of samples of a few classes.Secondly,at the algorithmic level,the Focal Loss function is introduced into the GBDT binary classification algorithm to in⁃crease the model's attention to a small number of samples.The performance of the base classifier is more stable by balancing each random subsampling in GBDT internal iteration.Comparative experiments are carried out on 10 sets of KEEL imbalanced data sets,and the experimental results verified the feasibility of the improvement.And the proposed improved algorithm is compared with the three popular imbalanced data classification algorithms,SMOTEBoost,RUSBoost,and CUSBoost.The experimental results show that the enhanced algorithm has the highest F1-measure value on seven sets of data and the highest G-mean value on six sets of da⁃ta.It is verified that the proposed improved algorithm has a good effect in dealing with the two classification problems of unbalanced data.
关 键 词:不平衡数据 梯度提升决策树 自适应综合过采样 焦点损失函数 随机子采样
分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.175