检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]山西大学计算机与信息技术学院,太原030006 [2]计算智能与中文信息处理教育部重点实验室,太原030006
出 处:《小型微型计算机系统》2014年第9期1961-1966,共6页Journal of Chinese Computer Systems
基 金:国家自然科学基金项目(71031006)资助;山西省科技基础条件平台建设项目(2012091002-0101)资助;山西省回国留学人员科研项目(2013-101)资助
摘 要:在处理混合型大数据时,已有孤立点检测算法往往存在时间代价大、适用性差等问题.为了解决这一问题,本文基于最近邻思想提出了一个混合数据孤立点检测算法.该算法依据邻域计数的思想给出混合数据对象之间的相异性度量,并基于最近邻定义了对象的孤立度,进而设计出一个混合数据孤立点检测算法,并且基于MapReduce编程模型对该算法进行了并行化设计以进一步提高算法执行效率.最后,在UCI数据集上通过与已有算法比较实验结果表明,本文提出的混合数据孤立点检测算法能有效地检测出孤立点,具有参数少、检测精度高的优点;算法的并行化实现提高了算法对混合型大数据的孤立点检测效率.When detect outliers in current massive mixed datasets, most existing outlier detection algorithms are not very effective and time-consuming. To overcome this deficiency, an outlier detection algorithm is proposed for mixed data based on nearest neighbors. This algorithm firstly defines the dissimilarity measure for mixed data in the light of neighborhood counting. Then, the definition of outlier factor is given. Outliers are those points having the largest values of outlier factor. To further improve the efficiency of the algorithm, a parallel outlier detection algorithm is designed based on MapReduce. The performance of the algorithm has been studied on several real world datasets. The comparisons with other outlier detection algorithms show that the proposed algorithm is more effective in detecting outliers with the merits of few parameters and high precision. And the experiment results of parallel algorithm show that it has high efficiency and scalability for massive mixed datasets.
关 键 词:孤立点检测 混合型数据 邻域计数 MAPREDUCE
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.15