基于局部信息熵的加权子空间离群点检测算法  被引量:28

Local Entropy Based Weighted Subspace Outlier Mining Algorithm

在线阅读下载全文

作  者:倪巍伟[1] 陈耿[2] 陆介平 吴英杰[1] 孙志挥[1] 

机构地区:[1]东南大学计算机科学与工程学院,南京210096 [2]南京审计学院审计信息工程实验室,南京210029 [3]江苏省镇江市科技局,江苏镇江212002

出  处:《计算机研究与发展》2008年第7期1189-1194,共6页Journal of Computer Research and Development

基  金:江苏省自然科学基金项目(BK2006095);教育部高等学校博士学科点专项科研基金项目(20040286009)

摘  要:离群点检测作为数据挖掘的一个重要研究方向,可以从大量数据中发现少量与多数数据有明显区别的数据对象."维度灾殃"现象的存在使得很多已有的离群点检测算法对高维数据不再有效.针对这一问题,提出基于局部信息熵的加权子空间离群点检测算法SPOD.通过对数据对象在各维进行邻域信息熵分析,生成数据对象相应的离群子空间和属性权向量,对离群子空间中的属性赋以较高的权值,进一步提出子空间加权距离等概念.采用基于密度离群点检测的思想,分析计算数据对象的子空间离群影响因子,判断是否为离群点.算法能够有效地适应于高维数据离群点检测,理论分析和实验结果表明算法是有效可行的.Outlier mining has become a hot issue in the field of data mining,which is to find exceptional objects that deviate from the most rest of the data set.However,along with the increase of dimension,some unusual characteristic appearance becomes possible,such as spatial distribution of the data,and the distance of full attribute space is no longer meaningful,which is called "curse of dimensionality".Phenomena of "curse of dimensionality" deteriorate lots of existing outlier detection algorithms' validity.Concerning this problem,a local entropy based weighted subspace outlier mining algorithm SPOD is proposed,which generates outlier subspace and weighted attribute vector of each data object by analyzing entropy of each attribute on the neighborhood of this data object.For a given data object,those outlier attributes which constitute this object's outlier subspace,are assigned with bigger weight.Furthermore definitions such as subspace weighted distance are introduced to make a density-based outlier processing upon the data set and get each data point's subspace outlier influence factor.The bigger this factor is,the bigger the possibility of the corresponding data point becoming an outlier is.Theoretical analysis and experimental results testify that SPOD is suitable for datasets with high dimension,and is efficient and effective.

关 键 词:高维数据 离群点检测 信息熵 子空间挖掘 权向量 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象