一种面向非均衡分类的随机森林算法  被引量:3

A Random Forest Algorithm for Imbalanced Classification

在线阅读下载全文

作  者:沈智勇 苏翀 周扬 沈智威 SHEN Zhi-yong;SU Chong;ZHOU Yang;SHEN Zhi-wei(School of Electrical and Information Engineering,Zhangjiagang Branch,Jiangsu University of Science and Technology, Zhangjiagang 215600,China;School of Urban Rail Transportation,Soochow University,Suzhou 215000,China)

机构地区:[1]江苏科技大学(张家港校区)电气与信息工程学院,江苏张家港215600 [2]苏州大学城市轨道交通学院,江苏苏州215000

出  处:《计算机与现代化》2018年第12期56-60,66,共6页Computer and Modernization

基  金:中国博士后科学基金资助项目(2016M600430);湖北省水电工程智能视觉监测重点实验室开放基金资助项目(2016KLA08)

摘  要:随机森林算法是一种简单、有效的集成学习算法。它通过自助法和随机化特征子集的方式增加了集成分类的多样性,进而构建出比Bagging和Boosting更精确的集成分类器。然而,当面对非均衡分类问题时,其建树所使用采用的分裂指标——Gini指数被证明对类分布敏感,这在一定程度上降低了随机森林的分类精度。本文提出一种使用K-L距离作为分裂指标的随机森林。实验采用ROC曲线下面积(AUC)作为分类性能评价指标,通过在低度非均衡数据集和高度非均衡数据集上分别与随机森林、平衡随机森林以及基于Hellinger决策树的Bagging集成分类器相比,K-L随机森林不仅在70%以上的实验数据集上优于其他分类器,而且其平均AUC值也优于其他分类器,分别为0. 938、0. 937。上述实验结果表明:使用K-L距离作为分裂指标可以有效提高随机森林处理非均衡分类问题的分类性能。Random Forest algorithm is a simple and effective integrated learning method.It increases the diversities of classes by choosing a subset of features or rotating feature space,and builds more accurate and diverse classifiers than Bagging and Boosting.However,the splitting criteria used for constructing each tree in Random Forest is Gini index,which is proven to be skew-sensitive.When learning from highly imbalanced datasets,class imbalance impedes their ability to learn the minority class concept.This paper uses K-L divergence as the splitting criterion for building each tree in Random Forest.An experimental framework is performed across a wide range of imbalanced datasets to investigate the effectiveness of K-L divergence based Random Forest which compares with Random Forest,Balanced Random Forest and Bagging with Hellinger decision trees in terms of area under ROC curve(AUC).The experimental results show that K-L divergence based Random Forest not only performs better than the others over more than 70%imbalanced datasets used in this experiment,but also is superior to the others according to the average AUC and obtains 0.938 and 0.937 across the lowly imbalanced datasets and the highly imbalanced datasets respectively.Finally,we conclude that it can improve the performances of Random Forest for imbalanced classification to use K-L divergence as the splitting criterion.

关 键 词:非均衡分类 K-L距离 随机森林 平衡随机森林 BAGGING 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象