检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]太原科技大学计算机科学与技术学院,太原030024
出 处:《小型微型计算机系统》2015年第3期460-465,共6页Journal of Chinese Computer Systems
基 金:国家自然科学基金项目(61272263)资助
摘 要:针对高维数据集,采用局部稀疏差异和局部密度差异的度量因子,给出一种相关子空间中的局部离群数据挖掘算法.该算法根据K最近邻(K-NN),确定数据集中各数据对象的局部数据集,并依据属性值的稀疏因子生成全局的稀疏因子矩阵和局部稀疏因子矩阵,从而有效地反映了数据对象的局部稀疏程度;根据局部稀疏因子矩阵,计算属性维对应的局部稀疏差异因子,并确定数据对象对应的子空间定义向量,从而体现了具有任意性相关的相关子空间;如果数据对象存在相关子空间,则采用高斯误差函数体现相关子空间中各数据对象的局部密度差异,有效地降低了"维灾"的影响,使得离群数据的度量与相关子空间的维度无关,并能够度量相关子空间的数据对象,否则设置数据对象的局部密度差异为0,表明其为正常数据;选取局部密度差异(离群程度)最大的若干数据对象作为局部离群数据;最后采用UCI和恒星光谱数据集,实验验证了该算法的有效性.This paper presents a detection algorithm to detect local outliers in related subspace of high dimensional datasets. Our algo- rithm utilizes local sparse differences and local density differences as two measurement factors to identify outliers. At the heart of our algorithm is an effective way of quantifying a data object's local sparse degree, which is derived from global and local sparse factor matrixes generated using the data object's attribute sparse factor. Each data object's local dataset is calculated from the data object's K- Nearest Neighbors or K-NN. After computing the local sparse difference factor of a data object's attribute dimensions ,the data object's subspace definition vector can be derived from the local sparse factor matrix. In doing so, our algorithm is able to characterize data ob- ject's arbitrarily related subspaces,which is used to determine the data object's local density difference expressed as a Gaussian error function. As a result, the "dimension disaster" effect can be significantly alleviated. Outlier measures in an related subspace is inde- pendent of a dataset's dimension. The data objects' outliemess can be measured from the perspective of any relevant subspace;other- wise,the data objectg local density differences is set to zero to indicate that the object is a normal data. With the local density differ- ence in place, our algorithm can judiciously identify outliers as data objects ranked on the first top N with high degree of local density difference. We conduct extensive experiments to validate the correctness and evaluate the effectiveness of the algorithm on the two re- al-world datasets, namely, the UCI and stellar spectral data sets.
关 键 词:局部离群数据 高维数据集 局部稀疏差异 局部密度差异 相关子空间
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.223.169.109