一种增量式文本软聚类算法  被引量:3

Incremental Algorithm of Text Soft Clustering

在线阅读下载全文

作  者:冯中慧[1] 鲍军鹏[1] 沈钧毅[1] 

机构地区:[1]西安交通大学电子与信息工程学院,西安710049

出  处:《西安交通大学学报》2007年第4期398-401,411,共5页Journal of Xi'an Jiaotong University

基  金:国家自然科学基金资助项目(60673087)

摘  要:针对传统文本聚类算法时间复杂度较高,而与距离无关的算法又不适用于动态、变化的文本集等问题,提出了一种基于语义序列的增量式文本软聚类算法.该算法考虑了长文本的多主题特性,并利用语义序列相似关系计算相似语义序列集合的覆盖度,同时将每次选择的具有最小熵重叠值的候选类作为一个结果聚类,这样在整个聚类的过程中大大减小了文本向量空间的维数,缩短了计算时间.由于所提算法的语义序列只与文本自身相关,所以它适用于增量式聚类.实验结果表明,算法的聚类精度高于同条件下的其他聚类算法,尤其适合于长文本集的软聚类.Focusing on the problems that the text clustering has high time complexity, the algo- rithms that are independent on the distance are unsuitable for dynamic and changing corpus, and the multi-subject characteristics of a single text cannot be considered in traditional algorithms, an incremental algorithm of text soft clustering based on semantic sequence is proposed, in which the clustering candidate with minimum entropy overlap value is selected as a result cluster by using similarity relation of semantic sequences and calculating the coverage of similarity semantic sequences set. The dimensions of text vector space are decreased dramatically in the clustering procedure, so the computing time can be reduced. Since the semantic sequence is only related to text, it is available for incremental clustering. The comparison of experimental results shows that the algorithm can achieve higher precision than other algorithms under same conditions, especially for soft clustering of long texts set.

关 键 词:语义序列 增量式聚类 软聚类 文本聚类 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象