一种基于簇相合性的文本增量聚类算法被引量：2

An Incremental Text Clustering Algorithm Based on Cluster Congruence

机构地区：[1]江西师范大学计算机信息工程学院,南昌330022 [2]江西财经大学网络信息管理中心,南昌330013 [3]江西师范大学初等教育学院,南昌330027

出　　处：《计算机工程》2014年第6期195-200,共6页Computer Engineering

基　　金：国家自然科学基金资助项目(61272212)

摘　　要：传统文本聚类方法只适合处理静态样本,且时间复杂度较高。针对该问题,提出一种基于簇相合性的文本增量聚类算法。采用基于词项语义相似度的文本表示模型,利用词项之间的语义信息,通过计算新增文本与已有簇之间的相合性实现对文本的增量聚类。增量处理完部分文本后,对其中错分可能性较大的文本重新指派类别,以进一步提高聚类性能。该算法可在对象数据不断增长或更新的情况下,避免大量重复计算,提高聚类性能。在20 Newsgroups数据集上进行实验,结果表明,与k-means算法和SHC算法相比,该算法可减少聚类时间,提高聚类性能。Traditional text clustering methods are only suitable for static sample, and their time complexity is too high. Aiming at these problems, this paper proposes a new Incremental Text Clustering Algorithm Based on Congruence（ITCAC） between text and cluster. The new algorithm can avoid a lot of double counting to improve the performance of clustering. It uses text representation model based on semantic similarity of lexical items, fully takes the semantic information between terms into account and computes the congruence between new documents and existing clusters. After processing part of the documents, the algorithm reassigns the categorization of documents that has large possibility of misclassification to further improve the clustering performance. Experimental results on 20 Newsgronps datasets show that, compared with the k-means algorithm and SHC algorithm, the new algorithm not only has less clustering time, but also has better performance of clustering.

关键词：文本聚类增量聚类语义相似度簇相合性文本再分配

分类号：TP18[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于簇相合性的文本增量聚类算法被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于簇相合性的文本增量聚类算法 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种基于簇相合性的文本增量聚类算法被引量：2