基于加权网格和信息熵的并行密度聚类算法  被引量:10

Parallel Density-Based Clustering Algorithm by Using Weighted Grid and Information Entropy

在线阅读下载全文

作  者:胡健[1,2] 徐锴滨 毛伊敏 HU Jian;XU Kaibin;MAO Yimin(School of Information Engineering,Jiangxi University of Science and Technology,Ganzhou,Jiangxi 341000,China;Department of Information Engineering,College of Applied Science,Jiangxi University of Science and Technology,Ganzhou,Jiangxi 341000,China)

机构地区:[1]江西理工大学信息工程学院,江西赣州341000 [2]江西理工大学应用科学学院信息工程系,江西赣州341000

出  处:《计算机科学与探索》2020年第12期2094-2107,共14页Journal of Frontiers of Computer Science and Technology

基  金:国家重点研发计划,No.2018YFC1504705;国家自然科学基金,No.41562019;江西省教育厅科技项目,Nos.GJJ151528,GJJ151531。

摘  要:针对大数据下基于密度的聚类算法中存在的数据网格划分不合理,聚类结果准确度不高以及并行化效率较低等问题,提出了基于MapReduce和加权网格信息熵的DBWGIE-MR算法。首先提出自适应网格划分策略(ADG)来划分网格单元;其次提出邻居网格扩展策略(NE)用于构建每个数据分区的加权网格,以此提高聚类效果;同时提出加权网格信息熵策略(WGIE)来计算网格密度以及密度聚类算法的ε邻域和核心对象,使密度聚类算法更适用于加权网格;接着结合MapReduce计算模型,提出并行计算局部簇算法(COMCOREMR),从而加快获取局部簇;最后提出了基于并查集的并行合并局部簇算法(MECORE-MR),用于加快合并局部簇的收敛速度,提升了基于密度的聚类算法对局部簇合并的效率。实验结果表明,DBWGIE-MR算法的聚类效果更佳,且在较大规模的数据集下算法的并行化性能更好。Aiming at the problems of unreasonable division of data gridding,low accuracy of clustering results and low efficiency of parallelization in big data clustering algorithm based on density,this paper proposes a densitybased clustering algorithm by using weighted grid and information entropy based on MapReduce,named DBWGIEMR.Firstly,an adaptive division grid(ADG)strategy is proposed to divide the cell of grid adaptively.Secondly,a weighted grid construction strategy,neighboring expand(NE)which can strengthen relevance between grids is designed to improve the accuracy of clustering.Meanwhile,based on weighted grid and information entropy(WGIE),a density calculation strategy is designed to calculate the density of grid.In addition,theε-neighborhood and core object of density-based clustering algorithm are recalculated,which is suitable for weighted grid.Then,COMCORE-MR(core clusters computing algorithm based on MapReduce)algorithm is proposed to compute the local clusters of clustering algorithm in parallel.Finally,based on disjoint-set and MapReduce,MECORE-MR(merge core cluster by using MapReduce)algorithm is proposed to speed up the convergence speed of merging local clusters,which improves the local clusters merging efficiency of density-based clustering algorithm.The experimental results show that the DBWGIE-MR algorithm has better clustering results and performs better parallelization in large scale dataset.

关 键 词:大数据 密度聚类 加权网格 信息熵 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象