一种改进的树路径模型在网页聚类中的研究  被引量:1

Research of Improved Tree Path Model in Web Page Clustering

在线阅读下载全文

作  者:王亚普[1] 王志坚[1,2] 叶枫[1,2] 

机构地区:[1]河海大学计算机与信息学院,南京211100 [2]南京航空航天大学计算机科学与技术学院,南京210016

出  处:《计算机科学》2015年第5期109-113,共5页Computer Science

基  金:江苏水利科技项目:"智慧河流"研究及其在六合滁河管理中的应用(2013025);河海大学中央高校基本科研业务费项目(2009B21614)资助

摘  要:相似度计算是文本挖掘的基础,也是信息提取过程的关键步骤。对于结构复杂的网页,当前基于传统树路径模型的相似度计算方法在准确性上尚不完善。传统树路径模型未考虑路径出现的先后顺序,并且比较路径相似度时用的是完全匹配,难以在不完全匹配时更精确地描述路径之间的相似度。因此,从网页结构相似度入手,提出了一种改进的树路径模型。该模型充分考虑了兄弟节点之间的关系、路径位置以及路径权重,弥补了传统树路径模型无法表达文档结构和层次信息的缺陷。实验结果表明,该模型提高了识别网页结构相似性的能力,既能对结构差别较大的网页进行良好的区分,又能较好地反映来自同一模板的网页之间的差异性,同时在网页聚类中具有更优的效果。Computing the similarity is the basis of text mining, and also the crucial step of information extraction. When tackling the Web pages with complex structure, the accuracy of computing the similarity based on traditional tree path model is not perfect. Traditional tree path model will not take the sequence of the paths in consideration and compare the similarity of paths by using perfect matching. It cannot describe the similarity between paths accurately when it is not a perfect matching. Therefore,the paper introduced the structural similarity Web at first,and then proposed a tree path model. This model takes fully account of the relationship between the siblings, the path location and the path weights,and makes up for the defect of the traditional tree path model which cannot express both document structure and hierarchical information. The experiment result shows that the model improves the recognition ability of Web pages structural similarity. It not only can better distinguish the Web pages which have large structure difference, but also effectively reflects the difference between the Web pages with the same template, also has a better effect in the Web page clustering.

关 键 词:信息提取 网页结构 相似度 树路径模型 聚类 

分 类 号:TP311.5[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象