检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:毛伊敏[1,2] 李文豪 Mao Yimin;Li Wenhao(School of Information Engineering,Jiangxi University of Science&Technology,Ganzhou Jiangxi 341000,China;School of Information Engineering,Shaoguan University,Shaoguan Guangdong 512000,China)
机构地区:[1]江西理工大学信息工程学院,江西赣州341000 [2]韶关学院信息工程学院,广东韶关512000
出 处:《计算机应用研究》2024年第2期473-481,共9页Application Research of Computers
基 金:广东省重点领域研发计划资助项目(2022B0101020002);广东省重点提升项目(2022ZDJS048)。
摘 要:针对大数据环境下并行深度森林算法中存在不相关及冗余特征过多、多粒度扫描不平衡、分类性能不足以及并行化效率低等问题,提出了基于互信息和融合加权的并行深度森林算法(parallel deep forest algorithm based on mutual information and mixed weighting,PDF-MIMW)。首先,在特征降维阶段提出了基于互信息的特征提取策略(feature extraction strategy based on mutual information,FE-MI),结合特征重要性、交互性和冗余性度量过滤原始特征,剔除过多的不相关和冗余特征;接着,在多粒度扫描阶段提出了基于填充的改进多粒度扫描策略(improved multi-granularity scanning strategy based on padding,IMGS-P),对精简后的特征进行填充并对窗口扫描后的子序列进行随机采样,保证多粒度扫描的平衡;其次,在级联森林构建阶段提出了并行子森林构建策略(sub-forest construction strategy based on mixed weighting,SFC-MW),结合Spark框架并行构建加权子森林,提升模型的分类性能;最后,在类向量合并阶段提出基于混合粒子群算法的负载均衡策略(load balancing strategy based on hybrid particle swarm optimization algorithm,LB-HPSO),优化Spark框架中任务节点的负载分配,降低类向量合并时的等待时长,提高模型的并行化效率。实验表明,PDF-MIMW算法的分类效果更佳,同时在大数据环境下的训练效率更高。In the context of big data environments,the parallel deep forest algorithm faces several challenges,such as an abundance of irrelevant and redundant features,imbalanced multi-granularity scanning,inadequate classification performance,and low parallelization efficiency.To tackle these issues,this paper proposed PDF-MIMW.Firstly,the algorithm introduced FE-MI in the phase of dimensionality reduction,which filtered the original feature set by combining feature importance,interaction,and redundancy metrics,thereby eliminating excessive irrelevant and redundant features.Next,the algorithm proposed an IMGS-P in the phase of multi-granularity scanning,which involved padding the reduced features and performing random sampling on the subsequences obtained after window scanning,thereby ensuring a balanced multi-granularity scanning process.Then,the algorithm put forth the SFC-MW in the phase of cascade forest construction,which utilized the Spark framework to parallelly construct weighted sub-forests,thereby enhancing the model’s classification performance.Finally,the algorithm designed a load balancing strategy based on a mixed particle swarm algorithm in the phase of class vector merging,which optimized the load distribution among task nodes in the Spark framework,reducing the waiting time during class vector merging and improving the parallelization efficiency of the model.Experiments demonstrate that the PDF-MIMW algorithm achieves superior classification performance and higher training efficiency in the big data environment.
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.69