相关子空间中的局部离群数据挖掘算法研究  被引量:17

Local Outlier Detection in Related Subspaces

在线阅读下载全文

作  者:李永红[1] 张继福[1] 荀亚玲[1] 

机构地区:[1]太原科技大学计算机科学与技术学院,太原030024

出  处:《小型微型计算机系统》2015年第3期460-465,共6页Journal of Chinese Computer Systems

基  金:国家自然科学基金项目(61272263)资助

摘  要:针对高维数据集,采用局部稀疏差异和局部密度差异的度量因子,给出一种相关子空间中的局部离群数据挖掘算法.该算法根据K最近邻(K-NN),确定数据集中各数据对象的局部数据集,并依据属性值的稀疏因子生成全局的稀疏因子矩阵和局部稀疏因子矩阵,从而有效地反映了数据对象的局部稀疏程度;根据局部稀疏因子矩阵,计算属性维对应的局部稀疏差异因子,并确定数据对象对应的子空间定义向量,从而体现了具有任意性相关的相关子空间;如果数据对象存在相关子空间,则采用高斯误差函数体现相关子空间中各数据对象的局部密度差异,有效地降低了"维灾"的影响,使得离群数据的度量与相关子空间的维度无关,并能够度量相关子空间的数据对象,否则设置数据对象的局部密度差异为0,表明其为正常数据;选取局部密度差异(离群程度)最大的若干数据对象作为局部离群数据;最后采用UCI和恒星光谱数据集,实验验证了该算法的有效性.This paper presents a detection algorithm to detect local outliers in related subspace of high dimensional datasets. Our algo- rithm utilizes local sparse differences and local density differences as two measurement factors to identify outliers. At the heart of our algorithm is an effective way of quantifying a data object's local sparse degree, which is derived from global and local sparse factor matrixes generated using the data object's attribute sparse factor. Each data object's local dataset is calculated from the data object's K- Nearest Neighbors or K-NN. After computing the local sparse difference factor of a data object's attribute dimensions ,the data object's subspace definition vector can be derived from the local sparse factor matrix. In doing so, our algorithm is able to characterize data ob- ject's arbitrarily related subspaces,which is used to determine the data object's local density difference expressed as a Gaussian error function. As a result, the "dimension disaster" effect can be significantly alleviated. Outlier measures in an related subspace is inde- pendent of a dataset's dimension. The data objects' outliemess can be measured from the perspective of any relevant subspace;other- wise,the data objectg local density differences is set to zero to indicate that the object is a normal data. With the local density differ- ence in place, our algorithm can judiciously identify outliers as data objects ranked on the first top N with high degree of local density difference. We conduct extensive experiments to validate the correctness and evaluate the effectiveness of the algorithm on the two re- al-world datasets, namely, the UCI and stellar spectral data sets.

关 键 词:局部离群数据 高维数据集 局部稀疏差异 局部密度差异 相关子空间 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象