双MapReduce改进的Canopy-Kmeans算法  被引量:6

Improved Canopy-Kmeans Algorithm based on Double-MapReduce

在线阅读下载全文

作  者:刘宝龙[1] 苏金[1] 

机构地区:[1]西安工业大学计算机科学与工程学院,西安710021

出  处:《西安工业大学学报》2016年第9期730-737,共8页Journal of Xi’an Technological University

基  金:陕西省科技统筹创新工程计划项目(2015KTCXSF-10-11);西安市未央区科技计划项目(201609)

摘  要:由于传统的Canopy-Kmeans算法在中心点的选取存在随机性,其迭代过程的冗余计算降低了算法的运行效率.文中基于"最小最大原则"和三角不等式原理,在Hadoop平台上提出了一种基于双MapReduce改进的Canopy-Kmeans算法.实验结果表明:设计的并行算法精确率在不同大小的数据集上平均提高了15.3%,加速比和扩展性随着数据规模和节点的不断增加也相应的提高了1.5~3倍,解决了Canopy中心点选中存在的问题和迭代过程中冗余的距离计算.The Canopy-Kmeans algorithm has the disadvantage of great randomness in the selection of center points,and the redundant computation in the iterative process significantly reduces the operation efficiency of the algorithm.So the paper proposes an improved Canopy-Kmeans algorithm based on the Double-MapReduce on the Hadoop platform,which is based on the " minimum maximum principle" and the principle of triangle inequality.The experimental results show that the precision of the designed parallel algorithm is raised by 15.3% on average,and the speedup and scalability are increased by 1.5to3 times with the increase of the data size and the number of node.The problem existing in the selection of Canopy center point is successfully solved and the redundant distance calculation in iterative is avoided.

关 键 词:Canopy-Kmeans 冗余计算 HADOOP平台 双MapReduce 

分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象