基于谱聚类的高维类别属性数据流离群点挖掘算法  被引量:10

Outlier mining algorithm for high dimensional categorical data streams based on spectral clustering

在线阅读下载全文

作  者:康耀龙 冯丽露[2] 张景安[3] 陈富 KANG Yao-long;FENG Li-lu;ZHANG Jing-an;CHEN Fu(School of Computer and Network Engineering,Shanxi Datong University,Datong 037009,China;School of Education Science and Technology,Shanxi Datong University,Datong 037009,China;Computer Network Center,Shanxi Datong University,Datong 037009,China;School of Mathematics and Statistics,Shanxi Datong University,Datong 037009,China)

机构地区:[1]山西大同大学计算机与网络工程学院,山西大同037009 [2]山西大同大学教育科学与技术学院,山西大同037009 [3]山西大同大学计算机网络中心,山西大同037009 [4]山西大同大学数学与统计学院,山西大同037009

出  处:《吉林大学学报(工学版)》2022年第6期1422-1427,共6页Journal of Jilin University:Engineering and Technology Edition

基  金:国家自然科学基金项目(61803241);大同市平台基地计划项目(2020196);山西省社会科学院(山西省人民政府发展研究中心)2021年度规划一般项目(YWYB202153)。

摘  要:为及时发现数据流中的异常数据、降低网络潜在威胁,提出了基于谱聚类的高维类别属性数据流离群点挖掘算法。分析了数据流具有有序性、高速性和高维性等特征,并探究了离群点的主要来源;利用属性权值量化方法,引入信息熵,将具有较强关联性的数据流合并,进而对数据流进行降维以减少干扰;采用谱聚类算法设置关键尺度参数,通过亲和矩阵计算样本与目标之间的距离,将谱聚类变换为无向图分割问题,获取特征矩阵,提取显著的离群点特征;使用距离挖掘方式,在数据流中加入数据块,判断两个邻近数据块之间的概率分布情况,设定滑动窗口,获取数据与滑动窗口间的距离,再与设定的阈值做比较,将离群点加入到集合中完成挖掘。仿真实验结果证明,对于不同大小和不同维度的数据流,该算法所需的执行时间分别在42 s和40 s内,对于数据流大小和维度具有较好的伸缩性,且挖掘出的离群点数据与实际相符。In order to discover abnormal data in the data stream in time and reduce potential threats to the network,a high-dimensional category attribute data stream outlier mining algorithm based on spectral clustering is proposed.The characteristics of orderliness,high speed and high dimensionality of data streams are analyzed,and the main sources of outliers are explored.Using the attribute weight quantization method,introducing information entropy,merging the data streams with strong relevance,and then reducing the dimensionality of the data streams to reduce interference.The spectral clustering algorithm is used to set key scale parameters,the distance between the sample and the target is calculated by the affinity matrix,the spectral clustering is transformed into an undirected graph segmentation problem,the feature matrix is obtained,and the significant outlier features are extracted.Using the distance mining method,data blocks is added to the data stream,the probability distribution between two adjacent data blocks is judged,a sliding window is set,the distance between the data and the sliding window is obtained,and then compare with the set threshold.Outliers are added to the set to complete the mining.The simulation results show that for data streams of different sizes and dimensions,the execution time required by the algorithm is within 42 s and 40 s respectively,and it has good scalability for the size and dimensions of data streams,and the outlier data mined is consistent with the reality.

关 键 词:计算机应用 谱聚类算法 高维类别属性 数据流 离群点挖掘 滑动窗口 

分 类 号:TP274[自动化与计算机技术—检测技术与自动化装置]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象