基于改进的K-means算法在文本挖掘中的应用  被引量:9

Application of Improved K-means Algorithm in Text Mining

在线阅读下载全文

作  者:杨丹 朱世玲 卞正宇 YANG Dan;ZHU Shi-ling;BIAN Zheng-yu(School of Computer Science and Technology,Nanjing University of Posts and Telecommunications,Nanjing 210003,China)

机构地区:[1]南京邮电大学计算机学院,江苏南京210003

出  处:《计算机技术与发展》2019年第4期68-71,共4页Computer Technology and Development

基  金:国家"863"高技术发展计划项目(2006AA01Z201)

摘  要:K-means算法具有简单易于理解的特征,广泛运用于聚类过程中,但是其初始聚类中心是随机确定的,这样极容易导致聚类结果的稳定性很差。针对传统K-means算法对于初始聚类中心选择的敏感性及最大最小距离法容易选取离散点的不足,提出了一种新的聚类中心选择评判函数,依次考察每个点的函数值,选取当前函数值最大的点作为新的聚类中心,直到满足事先确定的聚类中心数。新聚类中心评判函数既可以保证新中心点周围是紧凑的,又可以保证远离其他中心点。最后将该算法运应用于文本聚类之中,根据准确率、召回率及F度量值来衡量算法的聚类质量。实验结果表明,该算法相对于传统算法和最大最小距离算法,准确率更高,聚类质量更好,较适合于文本聚类。The K-means algorithm is simple and easy to understand ,widely used in the clustering process.However,the initial cluster centers are randomly determined,which can easily lead to poor stability of the clustering results.In view of the sensitivity of the traditional K-means algorithm to the selection of the initial clustering center and the shortcoming of the maximum and minimum distance method in the selection of discrete points,we propose a new evaluation function for the selection of the clustering center.The function value of each point is examined successively,and the point with the largest current function value is selected as the new clustering center until the predetermined number of clustering centers is satisfied.The new clustering center evaluation function can not only ensure the compactness around the new center point,but also keep it away from other centers.In the last,the improved algorithm is applied to text clustering,and its clustering quality is measured according to the accuracy rate,recall rate and F metric.The experiment shows that the proposed algorithm has higher accuracy,better clustering quality,which is more suitable for text clustering than the traditional algorithm and the maximum and minimum distance algorithm.

关 键 词:K-MEANS算法 聚类中心 文本聚类 文本距离 稀疏度 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象