检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王晓辉[1] 宋学坤[1] 王晓川[2] WANG Xiao-hui;SONG Xue-kun;WANG Xiao-chuan(Henan University of Chinese Medicine,Zhengzhou Henan 450046,China;Zhengzhou University,Zhengzhou Henan 450001,China)
机构地区:[1]河南中医药大学,河南郑州450046 [2]郑州大学,河南郑州450001
出 处:《计算机仿真》2021年第7期281-285,共5页Computer Simulation
基 金:国家自然基金青年项目(61702164,81703946);河南省科技攻关计划项目(172102310535);河南省高等学校青年骨干教师培养计划(2020GGJS104)。
摘 要:由于数据集规模、维数,以及复杂程度的不断提高,导致对其离群点的挖掘难度越来越大,提出了基于邻域密度的局部离群点挖掘算法。首先依据节点计算性能对高维数据进行区域分割,通过各个维度的数据分布来评价区域分割的效果。然后采取核密度来描述局部密度,根据高斯分布得到数据出现次数,进一步计算出数据邻域密度。再由邻域及密度关系计算得到各数据离群度,从而判断异构数据中的离群点。最后针对可能存在的离群误判情况,采取离群分数计算,为增强此过程的检测性能,利用权重进行剪枝处理。人工与UCI数据集上的仿真结果表明,当数据量和数据维数改变时,算法对离群点挖掘的准确度几乎不受影响,挖掘时间和覆盖率指标也显著优于其它方法;同时对于不同类型和复杂度的异构数据,算法仍然保持良好的挖掘准确度和效率。As the increasing of the size, dimension and complexity of data sets, it is more and more difficult to mine outliers. Therefore, a local outlier mining algorithm based on neighborhood density is proposed. Firstly, the high-dimensional data was segmented according to the node computing performance, and the effect of region segmentation was evaluated by the data distribution of each dimension. Then the kernel density was used to describe the local density, and the occurrence times of the data were obtained according to the Gaussian distribution, and the data neighborhood density was further calculated. Then the outlier degree of each data was calculated by neighborhood and density relationship, so as to judge the outlier in heterogeneous data. Finally, in view of the possible outlier misjudgment, the outlier score was calculated. In order to enhance the detection performance of this process, pruning was processed by weight. Simulation results on the datasets of artificial and UCI show that, when the amount of data and the dimension of data change, the accuracy of outlier mining is hardly affected, and mining time and coverage index are also significantly better than other methods;At the same time, for different types and complexity of heterogeneous data, the algorithm still maintains good accuracy and efficiency.
关 键 词:离群点挖掘 区域分割 邻域密度 异构数据 离群分数
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.145.163.51