面向大数据集的有效聚类算法  被引量:7

Efficient clustering algorithm for large data sets

在线阅读下载全文

作  者:古凌岚[1] 

机构地区:[1]广东轻工职业技术学院计算机工程系,广东广州510300

出  处:《计算机工程与设计》2014年第6期2183-2187,共5页Computer Engineering and Design

摘  要:为解决传统模糊C-均值算法无法适应大规模数据集体量大、冗余属性的问题,提出了一种面向大数据集的混合聚类算法。将大数据集划分为多个子集,对各子集进行聚类,通过合并得到最终聚类结果。对于子集采用基于基因表达式编程(GEP)和模糊C-均值的混合算法进行聚类,以改善聚类的质量和效率;基于相似性选取初始聚类中心,使用信息熵体现属性重要程度,从而进一步优化聚类性能。实验仿真及分析结果表明,该算法具有较好地全局收敛性,得到的聚类效果也更好。To solve the problem that traditional fuzzy C-means algorithm could not adopt to large scale datasets with large size and redundant attribute,a hybrid clustering algorithm for large data sets was proposed.The large data sets were divided into subsets,and each subset was first clustered,and then final clustering result was obtained by merging.The subset was clustered by a mixed algorithm based on gene expression programming (GEP) and fuzzy C-means.The quality and efficiency of clustering was improved.While initial clustering center was selected based on similarity,and the importance of data attribute was embedded by information entropy,thereby the clustering performance was optimized further.Simulation experiments showed that the algorithm had better global convergence,and could get even better clustering result.

关 键 词:大数据集 模糊C-均值 基因表达式编程 属性信息熵 聚类 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象