大数据环境下的投票特征选择算法  被引量:1

Voting Feature Selection Algorithm in Big Data Environment

在线阅读下载全文

作  者:周翔 翟俊海[1,2] 黄雅婕 申瑞彩 侯璎真 ZHOU Xiang;ZHAI Jun-hai;HUANG Ya-jie;SHEN Rui-cai;HOU Ying-zhen(College of Mathematics and Information Science,Hebei University,Baoding 071000,China;Hebei Key Laboratory of Machine Learning and Computational Intelligence,Hebei University,Baoding 071000,China)

机构地区:[1]河北大学数学与信息科学学院,河北保定071000 [2]河北大学河北省机器学习与计算智能重点实验室,河北保定071000

出  处:《小型微型计算机系统》2022年第5期936-942,共7页Journal of Chinese Computer Systems

基  金:国家自然科学基金项目(71371063)资助;河北省科技计划重点研发项目(19210310D)资助;河北省自然科学基金项目(F2017201026)资助;河北大学研究生创新项目(hbu2020ss045)资助。

摘  要:随着数据的爆炸式增长,大数据问题越来越受到关注,然而由于大数据具有维度较高、数据复杂且变化迅速的特性,导致传统的机器学习算法不再适用,故解决大数据特征选择问题迫在眉睫.本文基于投票机制和决策树算法提出了大数据环境下的投票特征选择算法.具体步骤为,随机划分大数据集U为L个子集,将划分后的L个子集发送到L个map节点,在每个map节点上使用决策树算法进行特征选择.在reduce节点,用每个map节点选择出来的特征进行投票,将得票数大于阙值的特征选择出来.将提出的算法在Hadoop和Spark两个开源大数据平台进行了实验,发现两个大数据平台的运行机制有诸多异同.此外,将提出的大数据投票特征选择算法和单变量特征选择算法与基于遗传算法的特征选择算法在5个高维数据集上进行了实验比对.经过对实验结果的分析,发现提出的算法相较于两个相关算法分类精度和执行效率都有更优的表现.证明了提出的算法优于这两个算法,可以有效地解决高维数据的特征选择问题.With the explosive growth of data,the problem of big data has attracted more and more attention.However,due to the characteristics of big data,such as high dimension,complex data and rapid change,the traditional machine learning algorithm is no longer applicable,so it is urgent to solve the problem of big data feature selection.Based on voting mechanism and decision tree algorithm,this paper proposes a voting feature selection algorithm in big data environment.The specific steps are:Randomly divide the large data set U into L subsets,send the divided L subsets to L map nodes,and use the decision tree algorithm to select features on each map node.In the reduce node,the features selected by each map node are used to vote,and the features with more votes than the threshold are selected.The proposed algorithm is tested on two open-source big data platforms,Hadoop and Spark,and it is found that there are many similarities and differences in the operation mechanism of the two big data platforms.In addition,the feature selection algorithm based on genetic algorithm and univariate feature selection algorithm are compared with the proposed big data voting feature algorithm on five high-dimensional data sets.Through the analysis of the experimental results,it is found that the proposed algorithm has better performance in classification accuracy and execution efficiency than the two related algorithms.It is proved that the proposed algorithm is superior to the two algorithms and can effectively solve the problem of feature selection of high-dimensional data.

关 键 词:大数据 特征选择 决策树 机器学习 投票机制 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象