基于弱相关化特征子空间选择的离散化随机森林并行分类算法  被引量:4

Parallelization of Random Forest Algorithm Based on Discretization and Selection of Weak-correlation Feature Subspaces

在线阅读下载全文

作  者:陈旻骋 袁景凌[1] 王啸岩 朱赛[1] 

机构地区:[1]武汉理工大学计算机科学与技术学院,武汉430070

出  处:《计算机科学》2016年第6期55-58,90,共5页Computer Science

基  金:国家自然科学基金(61303029);湖北省自然科学基金(2014CFB836);教育部留学回国人员科研启动基金([2012]1707)资助

摘  要:随着大数据时代的到来,数据信息呈几何倍数增长。传统的分类算法将面临着极大的挑战。为了提高分类算法的效率,提出了一种基于弱相关化特征子空间选择的离散化随机森林并行分类算法。该算法在数据预处理阶段对数据集中的连续属性进行离散化。在随机森林抽取特征子空间阶段,利用属性向量空间模型计算属性间的相关性,构造弱相关化特征子空间,使所构建的决策树之间相关性降低,从而提高随机森林的分类效果;并通过研究随机森林的并行化策略,结合MapReduce框架,改进并实现了随机森林模型构建过程的双重并行化,进一步改善了算法的计算效率。With the coming of the big data age, data information is increasing exponentially at a dramatic rate. The traditional classification algorithm will encounter great challenges. In order to improve the efficiency of classification algorithm, this paper proposd a parallel random forest algorithm based on discretization and the selection of the weak-correlation feature subspaces. This algorithm discretizes continuous attributes in data pretreatment phase. At the step of the selection of feature subspaces for growing decision trees, we used vector space modal of attributes to calculate the correlation between attributes, and then constructed the weak-correlation feature subspaces. This algorithm not only reduces the correlation among decision trees, but also improves the classifying effect of the random forest. We also designed and realized a double parallel method for building random forest model based on the MapReduce framework. This strategy goes a step further with its own charity efforts.

关 键 词:随机森林 离散化 弱相关化特征子空间 并行分类 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象