检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:刘振鹏 苏楠 秦益文 卢家欢 李小菲 LIU Zhen-peng;SU Nan;QIN Yi-wen;LU Jia-huan;LI Xiao-fei(School of Cyber Security and Computer,Hebei University,Baoding,Hebei 071002,China;Information Technology Center,Hebei University,Baoding,Hebei 071002,China;School of Electronic and Information Engineering,Lanzhou Jiaotong University,Lanzhou 730070,China)
机构地区:[1]河北大学网络空间安全与计算机学院,河北保定071002 [2]河北大学信息技术中心,河北保定071002 [3]兰州交通大学电子与信息工程学院,兰州730070
出 处:《计算机科学》2020年第8期185-188,共4页Computer Science
基 金:河北省自然科学基金(F2019201427);教育部“云数融合科教创新”基金(2017A20004)。
摘 要:大数据时代,攻击篡改、设备故障、人为造假等原因导致海量数据中潜藏着许多异常值。准确地检测出数据中的异常点,实现数据清洗,至关重要。文中提出一种结合特征切分与多层级联随机森林的异常点检测模型(outlier detection model based on Feature Segmentation and Cascaded Random Forest,FS-CRF)。利用滑动窗口与随机森林对原始特征进行细粒度切分,生成类概率向量,用于训练多层级联的随机森林;由级联层中最后一层的随机森林投票决定样本的最终类别。仿真实验结果表明,新方法在基于多个UCI数据集进行的异常分类任务中均获得较高F1-measure评分;级联结构使新模型相比于经典的随机森林算法进一步提高了泛化能力;在高维数据集上所提方法比梯度提升决策树和XGBoost拥有更优的性能,且超参数较少,易于调优,具有更好的综合性能。In the era of big data,there are many abnormal values hidden in massive data due to attack tampering,equipment fai-lure,artificial fraud and other reasons.Accurately detect outliers in data is critical to data cleaning.Therefore,an outlier detection model combining feature segmentation and multi-level cascaded random forest(FS-CRF)is proposed.Using the sliding window and the random forest to segment the original features,the generated class probability vector is used to train the multi-level cascaded random forest.Finally,the category of the sample is determined by the vote of the last layer.Simulation experiment results show that the new method can effectively detect outlier in classification tasks on UCI data sets,with high F1-measure scores obtained on both high and low dimensional data sets.The cascade structure further improves the generalization ability of the model compared to the classical random forest.Compared with the GBDT and XGBoost,the proposed method has performance advantages on high-dimensional data sets,and has fewer hyper-parameters that easy to tune and has better comprehensive performance.
关 键 词:数据清洗 细粒度特征 级联随机森林 集成学习 异常点检测
分 类 号:TP301[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.119.120.229