检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王婷婷[1,2] 翟俊海[1,2] 张明阳[1,2] 郝璞[1,2] WANG Tingting;ZHAI Junhai;ZHANG Mingyang;HAO Pu(Key Lab. of Machine Learning and Computational Intelligence;College of Mathematics and Information Science, Hebei University, Baoding 071002, Hebei, China)
机构地区:[1]河北大学河北省机器学习与计算智能重点实验室,河北保定071002 [2]河北大学数学与信息科学学院,河北保定071002
出 处:《山东大学学报(工学版)》2018年第3期54-59,共6页Journal of Shandong University(Engineering Science)
基 金:河北省自然科学基金资助项目(F2017201026);河北大学自然科学研究计划资助项目(799207217071);河北大学研究生创新资助项目(X2016059)资助
摘 要:针对大数据K-近邻(K-nearest neighbors,K-NN)计算复杂度高的问题,提出一种基于HBase和Sim Hash的大数据K-近邻分类算法。利用Sim Hash算法将大数据集从原空间映射到Hamming空间,得到哈希签名值集合;将样例的行键与值的二元对存储到HBase数据库中,行健(rowkey)为样例的哈希签名值,值(value)为样例的类别;对于测试样例,以其哈希签名值作为健rowkey,从HBase数据库中获取所有样例的value,通过对这些values进行多数投票,即可以得到测试样例的类别。与基于MapReduce的K-NN和基于Spark的K-NN在运行时间和测试精度两方面进行试验比较。试验结果显示,在保持分类能力的前提下,提出的算法的运行时间远远低于其他两种方法。Aiming at solving the problem of high computational complexity of K-NN( K-nearest neighbors) algorithm in big data scenarios,based on HBase and SimHash,a K-NN algorithm for big data classification was proposed. The big data sets were mapped from the original space into the Hamming space,and the sets of hash codes were obtained. The pairs of rowkeys and values were stored in HBase database; the rowkeys were the hash codes of instances; the values were the classes of instances. For testing instances,the values of instances which had same rowkeys were selected from HBase database,and the labels of testing instances were obtained by majority voting with the values. The proposed algorithm was experimentally compared with MapReduce-based K-NN and Spark-based K-NN on the running time and testing accuracy. The experimental results showed that the running time of the proposed algorithm was much lower than the times of the MapReduce-based K-NN and Spark-based K-NN in the case of classification performance preservation.
关 键 词:大数据 K-近邻 分类算法 HBASE SimHash
分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117