结合增益率与堆叠自编码器的并行随机森林算法  

Parallel random forest algorithm combining gain ratio and stacked auto encoders

在线阅读下载全文

作  者:刘卫明[1,2] 陈伟达 毛伊敏[1] 陈志刚[3] Liu Weiming;Chen Weida;Mao Yimin;Chen Zhigang(School of Information Engineering,Jiangxi University of Science&Technology,Ganzhou Jiangxi 341000,China;School of Resource&Environmental Engineering,Jiangxi University of Science&Technology,Ganzhou Jiangxi 341000,China;School of Computer Science&Engineering,Central South University,Changsha 410083,China)

机构地区:[1]江西理工大学信息工程学院,江西赣州341000 [2]江西理工大学资源与环境工程学院,江西赣州341000 [3]中南大学计算机学院,长沙410083

出  处:《计算机应用研究》2023年第3期750-759,765,共11页Application Research of Computers

基  金:2020年度科技创新2030—“新一代人工智能”重大项目(2020AAA0109605);国家自然科学基金资助项目(41562019)。

摘  要:针对大数据环境下随机森林算法存在冗余与不相关特征过多、特征子空间信息含量不足以及并行化效率低等问题,提出了结合增益率与堆叠自编码器的并行随机森林算法PRFGRSAE(parallel random forest algorithm combining gain ratio and stacked auto encoders)。首先,提出了结合非线性归一化增益率和堆叠自编码器的降维策略DRNGRSAE(dimension reduction combining nonlinear normalization gain ratio and stacked auto encoders),通过过滤特征集中的冗余和不相关特征,并利用堆叠自编码器提取特征,有效减少了冗余以及不相关特征数;其次,提出了结合拉丁超立方抽样与归一化相关度的子空间选择策略SSLF(subspace selection strategy combining Latin hypercube sampling and feature class correlation),通过对特征集进行多层划分抽样,形成空间表达度较高的特征子空间,有效保证了特征子空间的信息含量;最后,提出结合可变动作学习自动机的reducer分配策略DSVLA(distribution strategy based on variable-action learning automata),使每个数据簇均匀分配到reducer进行处理,有效提高了并行化效率。实验结果表明,PRFGRSAE算法的加速比与准确度较IMRF、KSMRF和GAPRF算法都有显著提升,因此该算法应用于大数据处理,特别对包含较多特征的数据集有更高的精准度和并行效率。In the big data environment, the random forest algorithm suffers from excessive redundancy and irrelevant features, the insufficient spatial information content of feature subspace, and low parallelization efficiency. To resolve these issues, this paper presented PRFGRSAE. Firstly, this algorithm proposed a DRNGRSAE, which filtered redundant and irrelevant features of the feature set and extracted features by stacked auto-encoders to reduce the number of redundant and irrelevant features effectively. Secondly, it proposed a SSLF that combined Latin hypercube sampling and normalized correlation degree, which formed feature subspaces with high spatial expression by performing multi-layer division sampling on the feature set, and ensured the feature subspace information content. Finally, it proposed a reducer allocation strategy DSVLA combining with variable action learning automata, which allocated each cluster to reducers for processing evenly and improved the parallelization efficiency effectively. Experimental results show that compared with IMRF, KSMRF, and GAPRF algorithms, the speedup ratio and accuracy of the PRFGRSAE algorithm are significantly improved. Therefore, the algorithm can obtain higher accuracy and parallel efficiency when applied to process large data, especially for data sets with more features.

关 键 词:大数据 MAPREDUCE 并行随机森林 增益率 堆叠自编码器 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象