Spark下基于PCA和分层选择的随机森林算法被引量：1

Random Forest Algorithm Based on PCA and Hierarchical Selection Under Spark

作　　者：雷晨毛伊敏[1] LEI Chen;MAO Yimin(School of Information Engineering,Jiangxi University of Science&Technology,Ganzhou,Jiangxi 341000,China)

机构地区：[1]江西理工大学信息工程学院,江西赣州341000

出　　处：《计算机工程与应用》2022年第6期118-127,共10页Computer Engineering and Applications

基　　金：国家重点研发计划(2018YFC1504705);国家自然科学基金(41562019);江西省教育厅科技项目(GJJ151528,GJJ151531)。

摘　　要：针对大数据背景下随机森林算法中存在协方差矩阵规模较大、子空间特征信息覆盖不足和节点通信开销大的问题,提出了基于PCA和子空间分层选择的并行随机森林算法PLA-PRF(PCA and subspace layer sampling on parallel random forest algorithm)。对初始特征集,提出了基于PCA的矩阵分解策略(matrix factorization strategy,MFS),压缩原始特征集,提取主成分特征,解决特征变换过程中协方差矩阵规模较大的问题;基于主成分特征,提出基于误差约束的分层子空间构造算法(error-constrained hierarchical subspace construction algorithm,EHSCA),分层选取信息素特征,构建特征子空间,解决子空间特征信息覆盖不足的问题;在Spark环境下并行化训练决策树的过程中,设计了一种数据复用策略(data reuse strategy,DRS),通过垂直划分RDD数据并结合索引表,实现特征复用,解决了节点通信开销大的问题。实验结果表明PLA-PRF算法分类效果更佳,并行化效率更高。In the context of big data, the random forest algorithm has large covariance matrix, insufficient coverage of subspace feature information and high node communication overhead. A parallel random forest algorithm based on PCA and subspace hierarchical selection, PLA-PRF(PCA and subspace layer sampling on parallel random forest algorithm).For the initial feature set, a PCA-based matrix factorization strategy(MFS)is proposed to extract principal component features to solve the problem of large covariance matrix in the process of feature transformation. Based on the obtained principal component features, a hierarchical subspace construction algorithm(error-constrained hierarchical subspace construction algorithm, EHSCA)based on error constraints is proposed, which selects pheromone features hierarchically, constructs feature subspaces, and solves the problem of insufficient coverage of subspace feature information. In the process of parallel training decision trees in the Spark environment, a data reuse strategy(DRS)is designed to solve the problem of high node communication overhead. By vertically dividing RDD data objects, it improves the performance of the distributed environment. Data utilization rate solves the problem of high node communication overhead. Experimental results show that PLA-PRF has better classification effect and higher parallelization efficiency.

关键词：随机森林 SPARK 主成分分析(PCA) 分层抽样误差约束数据划分数据复用

分类号：TP31[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Spark下基于PCA和分层选择的随机森林算法被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Spark下基于PCA和分层选择的随机森林算法 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

Spark下基于PCA和分层选择的随机森林算法被引量：1