一种改进的基于广义后缀树的文本聚类算法被引量：7

An Improved Text Clustering Algorithm of Generalized Suffix Tree

出　　处：《信息与控制》2009年第3期331-336,共6页Information and Control

基　　金：国家自然科学基金资助项目(60673087;60377020)

摘　　要：分析了基本STC算法存在的三个缺点,即不能有效处理包含文本数目差距较大但具有包含关系的节点,不能有效处理包含文本相似但主题不同的节点,缺乏有效的类别标识提取算法。针对以上问题,在综合考虑主题相似性以及文本包含相似性的基础上,给出了改进的用于基类合并的相似度公式,并提出基于信息增益的类别标识提取算法。为了进一步提高聚类效率,给出了一种简单有效的用于基类选择的测度,用来排除一些无意义的广义后缀树节点。实验结果表明,所提算法不仅可以有效提高STC算法的聚类准确度,而且可以对聚类结果进行有效的类别标识。The original suffix tree clustering （STC） algorithm can not effectively process the nodes with text documents that differ greatly in quantity but hold a relation of inclusion, neither the nodes that are similar in text but different in topic, and it lacks an effective algorithm for class label extraction. To solve these problems, an improved similarity formula is presented for base cluster merging based on both the similarity of topic and the included texts, and a class label extraction algorithm based on information gain is proposed. To improve the clustering efficiency, a simple but reasonable measure for base cluster selection is presented to exclude some generalized suffix tree nodes which contribute less tO the clustering. Experiment is made and the results prove that the presented clustering algorithm can efficiently increase the precision of text clustering and perform effective labeling for the clustering result.

关键词：文本聚类 WEB挖掘广义后缀树后缀树聚类(STC)

分类号：TP18[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种改进的基于广义后缀树的文本聚类算法被引量：7

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种改进的基于广义后缀树的文本聚类算法 被引量：7

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种改进的基于广义后缀树的文本聚类算法被引量：7