基于语义的中文文本聚类最佳簇数研究  

Study on semantic-based Chinese text optimal number of clusters

在线阅读下载全文

作  者:刘金岭[1] 

机构地区:[1]江苏淮阴工学院计算机系,江苏淮安223003

出  处:《计算机工程与设计》2010年第9期2034-2036,2100,共4页Computer Engineering and Design

摘  要:分析了聚类数目的确定对大样本数据聚类效果的影响,对目前聚类质量衡量指标的几个主要流行观点进行了剖析。利用文本相似度的概念对文本语义最佳聚类数问题进行了研究,提出了一种基于聚类过程的文本最佳聚类数算法CTBP,其主要思想是在文本向量集的每个文本向量中抽取出一个词汇,按相似度有序排列,用增量逐层划分以得到最优划分所对应的簇类数。这样通过扫描一遍数据就可以获得多个统计信息,最后求出最优解。实验结果表明了该算法的高质量和高效率。The effect of the cluster numbers on the large sample data cluster is analyzed, and some prevailing ideas of measurement index for the clustering quality are expounded. The optimal class number of text semantic are studied by the concept of text similarity, and an optimal number of clusters algorithm CTBP in clustering process is presented, and the main idea is to extract a word in each text vector and came into being ordered to array with text similarity, and the class number in optimal dividing has been used to get from the increment which is divided layer by layer. Statistical information can get from using scanning the data a time, and finally obtained the optimal solution. The experimental result shows that our method is helpful to develop speed and quality.

关 键 词:文本聚类 聚类簇数 增量 划分 CTBP 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象