基于Hadoop二阶段并行模糊c-Means数据聚类算法  被引量:2

Hadoop Secondary Parallel Fuzzy c-Means Clusting Algorithm

在线阅读下载全文

作  者:高献卫 师智斌[1] 

机构地区:[1]中北大学计算机与控制工程学院,太原030051

出  处:《计算机测量与控制》2015年第3期842-846,共5页Computer Measurement &Control

基  金:国家自然科学基金(50976108);山西省自然科学基金(2012011011-3)

摘  要:为了解决MapReduce机制下算法通信时间占用比过高实际应用价值受限的问题,提出了基于Hadoop二阶段并行c-Means聚类算法;首先,采用成员管理协议方式实现成员管理与MapReduce降低操作的同步化方法,改进MapReduce机制下的MPI通讯管理方法;其次,实行典型个体组降低操作代替全局个体降低操作,并定义二阶段缓冲算法,通过第一阶段的缓冲进一步降低第二阶段MapReduce操作的数据量,尽可能降低大数据带来的对算法负面影响;通过仿真实验表明该算法在处理大数据上的性能表现较为优异;该算法在大规模数据集上的并行率和加速比都优于小型数据集上的表现,说明了该算法能够实时根据数据量的大小对自身进行调整。According to the problem of high complexity of MPI communication strategies under the framework of traditional MapReduce, put forward a kind of secondary parallel fuzzy e--Means clustering algorithm. Firstly, improve MPI communication management method under the MapReduce mechanism, synchronization use membership management protocol mode to realize the management and members of MapReduce reduce the operation. Secondly, A typical individual operation instead of global individual operation, and define the two stage buffer algorithm, the big data to further reduce the second stage MapReduce operation through the first stage of the buffer, reduce the data brought about negative im- pact as much as possible. Through the simulation experiments show that the algorithm in dealing with the big data on the performance is more outstanding. The algorithm in parallel rate and speed ratio on the big data, were superior to the small data , shows that the algorithm can real-- time adjustments according to the size of the data of its own.

关 键 词:二阶段 模糊c—Means 大数据 数据聚类 HADOOP 

分 类 号:TP312[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象