结合语义的改进FTC文本聚类算法  被引量:5

Improvement on FTC text clustering algorithm combined with semantics

在线阅读下载全文

作  者:王秀慧[1] 王丽珍[1] 麻淑芳[1] 

机构地区:[1]山西大同大学教育科学与技术学院,山西大同037009

出  处:《计算机工程与设计》2014年第2期515-519,共5页Computer Engineering and Design

基  金:山西省科技基础条件平台基金项目(2011091002-0102);山西大同大学青年科研基金项目(2010Q13)

摘  要:针对FTC文本聚类算法未考虑词语之间语义联系以及硬划分聚类的缺陷,提出了一种结合语义的改进FTC文本聚类算法SFTC。SFTC基于知网把文本的关键词集映射成概念集合,采用FP-Growth算法在概念层次上挖掘频繁项集并以此生成候选簇。考虑到文本具有多主题性,定义了簇间相似度度量公式,在生成结果簇的过程中通过判断相似度大小来决定簇间是否应该存在重叠,实现了文本聚类在一定程度上的软划分。实验结果表明,SFTC算法具有更高的聚类准确度和更高的运行效率。To solve the problems of neglecting the semantic relation among different words and hard-partition clustering in FTC, an improved FTC text clustering algorithm combined with semantics which is called SFTC is proposed. First, by using HowNet, the keywords set of all documents is mapped into a concept set. The set of frequent term sets is found from the concept set by FP-Growth. The cover of each frequent term set is regarded as a candidate cluster. Second, a formula for computing the similari- ty between different clusters is defined. To determine weather the overlap should be existed between different clusters, similarity is measured while getting final clusters. By this way, an elastic classification is gotton. Experimental results show that SFTC improves the cluster quality and has better efficiency.

关 键 词:文本聚类 频繁项集 知网 簇相似度 软划分 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象