基于Spark的压缩近邻算法  被引量:2

Spark Based Condensed Nearest Neighbor Algorithm

在线阅读下载全文

作  者:张素芳[1] 翟俊海[2] 王婷婷[2] 郝璞[2] 王聪[2] 赵春玲 ZHANG Su- fang1, ZHAI Jun- hai2 ,WANG Ting-ting2,HAO Pu2, WANG Cong2, ZHAO Chun- ling2(1Hebei Branch of China Meteorological Administration Training Centre,China Meteorological Administration, Baoding, Hebei 071000, China;2Key Lab. of Machine Learning and Computational Intelligence, College of Mathematics and Information Science, Hebei Universty,Baoding Hebei071002,Chn)

机构地区:[1]中国气象局气象干部培训学院河北分院,河北保定071000 [2]河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室,河北保定071002

出  处:《计算机科学》2018年第B06期406-410,共5页Computer Science

基  金:国家自然科学基金项目(71371063);河北省自然科学基金项目(F2017201026);河北大学自然科学研究计划项目(799207217071);河北大学大学生创新训练项目(2017071)资助

摘  要:K-近邻(K-Nearest Neighbors,K-NN)是一种懒惰学习算法,用K-NN对数据分类时,不需要训练分类模型。K-NN算法的优点是思想简单、易于实现;缺点是计算量大,原因是在对测试样例进行分类时,其需要计算测试样例与训练集中每一个训练样例之间的距离。压缩近邻算法(Condensed Nearest Neighbors,CNN)可以克服K-NN算法的不足。但是,在面对大数据集时,由于自身的迭代计算特性,CNN的运算效率会变得非常低。针对这一问题,提出一种名为Spark CNN的压缩近邻算法。在大数据环境下,与基于MapReduce的CNN算法相比,Spark CNN的效率大幅提高,在5个大数据集上的实验证明了这一结论。K-nearest neighbors(K-NN)is a lazy learning algorithm.It is unnecessary to train classification models,when one uses K-NN for data classification.K-NN algorithm is simple and easy to implement.The disadvantages of KNN is that it requires large number of computations,which is introduced by calculating distances between testing instance and every training instance.Condensed nearest neighbors(CNN)can overcome the drawback of K-NN mentioned above.However,CNN is an iterative algorithm,when it is applied in big data scenario,its efficiency becomes very low.In order to deal with this problem,this paper proposed an algorithm named Spark CNN.In big data circumstances,Spark CNN can significantly improve the efficiency of CNN.This paper experimentally compared the Spark CNN with MapReduce CNN on 5 big data sets,the experimental results show that the Spark CNN is very effective.

关 键 词:压缩近邻 大数据 样例选择 迭代计算 懒惰学习 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象