孤立森林算法研究及并行化实现  被引量:15

Research and Parallelization of Isolation Forest Algorithm

在线阅读下载全文

作  者:王诚[1] 狄萱 WANG Cheng;DI Xuan(School of Telecommunications&Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003,China)

机构地区:[1]南京邮电大学通信与信息工程学院,江苏南京210003

出  处:《计算机技术与发展》2021年第6期13-18,共6页Computer Technology and Development

基  金:江苏省自然科学基金项目(BK20141428)。

摘  要:异常检测是近年来数据挖掘中热门的研究课题之一,孤立森林算法是一种高效的无监督的异常检测算法,可以很好地处理高维大规模数据。针对孤立森林算法在计算测试样本的异常值时,计算的是测试样本在孤立森林下的平均路径长度,忽略了孤立二叉树间检测异常能力的差异性以及大规模数据下构建大量孤立二叉树需要耗费大量内存时间这两点不足,提出一种并行化改进孤立森林算法。利用每棵孤立二叉树的路径长度标准差对其进行加权计算异常值,并基于Spark平台实现并行化。通过在公开数据集上进行的对比实验及多种参数配置的并行性能对比实验表明,并行化改进孤立森林算法能够提高异常检测的精确度,同时具有很好的并行性能,能够高效处理需要构建大量孤立二叉树的大规模数据集。Anomaly detection is one of the hot research topics in data mining in recent years. Isolation Forest algorithm is an efficient unsupervised anomaly detection algorithm that can handle high-dimensional large-scale data well. When Isolation Forest algorithm calculates the outliers of test samples, it calculates the average path length of test samples in Isolation Forest, ignoring the difference in the ability to detect abnormalities between isolation trees and the large amount of memory and time needed to construct a larger number of isolation trees under large-scale data. For these two deficiencies, an improved parallelized Isolation Forest algorithm is proposed. The standard deviation of the path length of each isolation tree is used to weight the outliers, and the parallelization is implemented based on the Spark platform. The comparison experiments on public datasets and parallel performance comparison experiments with multiple parameter configurations show that the proposed algorithm can improve the accuracy of anomaly detection with excellent parallel performance, and can effectively deal with large-scale data sets that need to build a large number of isolation trees.

关 键 词:异常检测 孤立森林算法 孤立二叉树 SPARK 并行化 

分 类 号:TP301.[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象