一种改进的k均值文本聚类算法被引量：5

An improved k-means algorithm for text clustering

机构地区：[1]桂林电子科技大学计算机与信息安全学院,广西桂林541004

出　　处：《桂林电子科技大学学报》2016年第4期311-314,共4页Journal of Guilin University of Electronic Technology

基　　金：国家863计划(2012AA011005)

摘　　要：针对k均值算法在文本聚类中由于初始聚类质心随机选择,使得聚类结果陷入局部最优,且孤立点和不确定的聚类个数造成k均值算法准确性低、收敛速度慢的问题,提出了一种改进的k均值文本聚类算法。该算法采用fp-growth算法挖掘文本频繁项集,过滤频繁项集得到核心频繁项集,并利用核心频繁项集指导文本初始聚类质心和聚类个数的生成,最后k均值算法利用初始聚类质心和聚类个数完成文本聚类。在新浪微博数据集上进行文本聚类实验,实验结果表明,改进的k均值算法提高了文本聚类的准确性,加快了收敛速度,具有较强的鲁棒性。Random selection of initial cluster centroid in k-means algorithm for text clustering resulted in local optimization of clustering results,and isolated points and indeterminate cluster number led to low accuracy and slow convergence speed of kmeans algorithm.So an improved k-means algorithm for text clustering was proposed.In the proposed algorithm,fpgrowth algorithm was used for mining frequent item sets of text,and frequent item sets of text were filtered to obtain the core frequent item sets,and then the core frequent item sets were adopted to generate initial cluster centroid and the number of clustering.Finally k-means algorithm was applied for text clustering with the generated initial cluster centroid and the number of clustering.The results of text clustering experiment on Sina microblog dataset show that the improved k-means algorithm can effectively improve the accuracy of text clustering and accelerate the convergence speed,and has strong robustness.

关键词：文本聚类 FP-GROWTH K均值

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种改进的k均值文本聚类算法被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种改进的k均值文本聚类算法 被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种改进的k均值文本聚类算法被引量：5