基于相关子空间的扩展隔离森林离群检测算法  被引量:1

An Extended Isolation Forest Outlier Detection Algorithm Based on Relevant Subspace

在线阅读下载全文

作  者:刘佳[1] 朱鹏云 荀亚玲[1] LIU Jia;ZHU Peng-yun;XUN Ya-ling(School of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan 030024,China)

机构地区:[1]太原科技大学计算机科学与技术学院,山西太原030024

出  处:《计算机技术与发展》2022年第10期26-33,40,共9页Computer Technology and Development

基  金:国家自然科学基金项目(61602335);山西省自然科学基金(201901D211302)。

摘  要:扩展隔离森林离群检测作为一种集成离群检测方法,可选取随机斜率的超平面,具有将离群数据与正常数据对象快速分离,时间复杂度较低等优点,但隔离树超平面选取在数据集密集区域或含有无关维度数据区域时,严重影响了其离群检测的效果。采用相关子空间思想和方法,提出了一种扩展隔离森林离群检测算法。该算法利用高斯混合模型确定数据对象的相关子空间,从而保证了能够在稀疏数据区域中选取隔离树的切割超平面;隔离树分枝分割优先在稀疏数据区域中,选择隔离树超平面的随机截距点,可快速地将离群数据对象从稀疏数据区域中隔离出来,从而避免了在超平面的随机斜率选取时无关属性维度的干扰;将每个数据对象在各隔离树上的平均路径长度归一化后作为离群得分,并选取离群得分最大的若干个数据对象作为离群数据;在UCI数据集上通过实验验证了该算法的有效性,以及抽样数、隔离树个数和近邻数参数对其离群检测效果的影响。The extended isolation forest outlier detection algorithm,as an ensemble outlier detection method,can select the hyperplane of random slope and has the advantages in separating outliers from normal data and time complexity.But the hyperplane selection of the extended isolation tree in the dense area of the data set or the area with irrelevant dimensions is of great significance to the outlier detection effect.An extended isolation forest outlier detection algorithm is proposed by using the idea and method of relevant subspace.It utilizes Gaussian mixture model to definite the relevant subspace of data objects,which guarantees to select the branching hyperplane of the isolation tree in the sparse data area.During constructing each extended isolation tree,random intercept points of hyperplanes are preferentially selected in the data-sparse region so as to isolate outliers from the data-sparse region quickly.And it can avoid the interference of irrelevant attribute dimensions when selecting the hyperplane’s random slope.Then the outlier score of each data object is obtained by normalizing the average path length in each isolation tree,and the selection of several data objects with the largest outlier score is defined as the outliers.Experimental results validate the effectiveness of the algorithm and the effects of parameters,including sub-sample size,the number of isolation tree and nearest neighbors on outlier detection in UCI data sets.

关 键 词:离群检测 扩展隔离森林 相关子空间 高斯混合模型 稀疏数据区域 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象