一种新的加权后缀树Web文档聚类方法  被引量:2

Novel Weighted Suffix Tree Clustering for Web Documents

在线阅读下载全文

作  者:杨瑞龙[1] 朱庆生[1] 谢洪涛[1] 屈洪春[1] 

机构地区:[1]重庆大学计算机学院,重庆400044

出  处:《系统仿真学报》2011年第3期474-479,共6页Journal of System Simulation

基  金:国家科技支撑计划(2007BAH08B04);重庆市科技支撑计划(2008AC20084)

摘  要:针对Web文档的结构及其特征,提出了一种新的加权后缀树聚类方法WSTC。首先,根据Web文档的HTML标签,把文档划分为具备不同重要性等级的段,段划分成句子,句子分割为词。其次,用句子替代文档构造后缀树,把其重要性等级作为结构权融入后缀树的节点,形成文档集的加权后缀树模型。最后,在选择和合并基类过程中,综合利用节点包含的文档数、句子数、短语长度和结构权。仿真实验表明,WSTC算法比传统STC算法取得了更好的聚类效果。For Web documents clustering,a novel Weighted Suffix Tree Clustering(WSTC) method was proposed.First,according to the structure and HTML tags of Web documents,different parts of documents were assigned different levels of significance as structure weights;each part was partitioned into some sentences which were partitioned into some words.Second,the weighted suffix tree of documents set was built with sentences and structure weights stored in the nodes.Finally,the documents count,sentences count,phrase length and structure weights of each internal node were employed in the process of identifying and merging base clusters.The evaluation experimental results indicate that WSTC is much more effective on clustering Web documents than original STC.

关 键 词:后缀树 后缀树聚类 WEB文档聚类 Web文档结构 权重计算 

分 类 号:TP397.2[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象