基于融合数据自表示的离群点检测算法  被引量:2

An Outlier Detection Algorithm Based on Fusion DataSelf-representation

在线阅读下载全文

作  者:高亚星 赵旭俊[1] 曹栩阳 GAO Ya-xing;ZHAO Xu-jun;CAO Xu-yang(School of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan 030024,China)

机构地区:[1]太原科技大学计算机科学与技术学院,山西太原030024

出  处:《计算机技术与发展》2023年第12期41-48,共8页Computer Technology and Development

基  金:国家自然科学基金(61572343);国防科技重点实验室基金项目资助(JSY6142219202114);山西省应用基础研究计划项目(20210302123223,202103021224275)。

摘  要:数据自表示方法可以用于离群点检测,起到了放大数据间差异性和关联性的作用,但现有技术未能体现特征之间关联性对离群点检测的影响,因此无法用于高维数据。针对这个问题,提出了一种基于融合数据自表示的离群点检测算法,它可以有效地检测出高维数据中的离群点。首先,提出了一种基于特征关系的数据自表示方法,结合互信息与信息熵理论,度量高维数据特征间的关联性,并将其融于数据间的稀疏表示过程,体现了特征间和数据间的复杂关系。其次,提出了一种基于融合组间数据自表示的计算方法,采用点乘的方式将不同特征分组对应的自表示矩阵融于一体,形成全局数据自表示矩阵。最后,提出基于融合数据自表示的离群点检测算法,在全局数据自表示矩阵形成的有向加权图上,通过图随机游走检测离群点。实验结果表明,该算法在真实数据集和人工合成数据集上的检测性能均高于对比算法,证明该算法具有良好的泛化性和稳定性。Data self-representation method can be used for outlier detection,which plays a role in magnifying the difference and correlation among data.However,the existing technologies fail to reflect the influence of correlation among features on outlier detection,so it cannot be used for high-dimensional data.To solve this problem,an outlier detection algorithm based on fusion data self-representation is proposed,which can effectively detect outliers in high-dimensional data.Firstly,a data self-representation method based on feature correlation is proposed,which combines mutual information and information entropy theory to measure the correlation among features of high-dimensional data,and integrates it into the sparse representation process among data,reflecting the complex relationship among features and data.Secondly,a calculation method based on the data self-representation among fusion groups is proposed.The self-representation matrix corresponding to different feature groups is integrated by point multiplication to form a global data self-representation matrix.Finally,an outlier detection algorithm based on fusion data self-representation is proposed.On the directed weighted graph formed by the global data self-representation matrix,outliers are detected by graph random walk.The experimental results show that the detection performance of the proposed algorithm on real datasets and synthetic datasets is higher than that of the comparison algorithm,which proves that the proposed algorithm has good generalization and stability.

关 键 词:离群点检测 数据自表示 特征分组 信息熵 随机游走 

分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象