基于质心向量的增量式主题爬行  被引量:4

Centroid-Based Focused Crawler with Incremental Ability

在线阅读下载全文

作  者:王辉[1] 左万利[2] 王晖昱[3] 宁爱军[1] 孙志伟[1] 满春雷[1] 

机构地区:[1]天津科技大学计算机科学与信息工程学院,天津300222 [2]吉林大学计算机科学与技术学院,长春130012 [3]澳大利亚卧龙岗大学信息学院,新南威尔士卧龙岗2500

出  处:《计算机研究与发展》2009年第2期217-224,共8页Journal of Computer Research and Development

基  金:天津科技大学引进人才科研启动基金项目(20080418);天津市高等学校科技发展计划基金项目(20071303);吉林省科技发展计划基金项目(20070533)~~

摘  要:研究如何在一个网页内部进行有选择的爬行.使用TFIDF-2模型以及Max,Ave,Sum三个启发式规则分别计算文档特征权重和质心特征权重,在此基础上构建与根集文档相对应的质心向量,利用它作为前端分类器指导主题爬行.使用前后端分类器分别给Frontier中的各个锚文本打分,将它们的打分求和,从中选择打分最高的链接,下载其对应的网页.实验结果表明,在质心向量的指导下,爬行程序借助于锚文本便可以准确地预测链接所指向网页的相关性;另外,双分类器框架还使得爬行策略具有增量爬行的能力.How to crawl selectively in a Web page is studied in this paper. Document feature weight and centroid feature weight are calculated based on the proposed TFIDF-2 model and the three heuristic rules Max, Ave, and Sum. After these two weights are figured out, a centroid vector which corresponds to a root set can be easily constructed. The centroid vector is then used as a front-end classifier to guide a focused crawler. First of all, the authors use the front-end classifier and the backend one respectively to score anchor texts of URLs. Then, they sum up the two anchor text scores of the same URL. Finally, they select the URL which has the highest anchor text score from the frontier and download the URL's corresponding Web page. Four series experiments are conducted. Experimental results show that with the aid of newly constructed centroid vector, the focused crawler can efficiently and accurately predict the relevance of a Web page simply by using URLs' corresponding anchor texts. Furthermore, the two classifiers' framework contributes to the focused crawler an incremental crawling ability, which is one of the most important and interesting features and must be settled down in the domain of focused crawling.

关 键 词:文档特征权重 质心特征权重 主题爬行 锚文本 质心向量 

分 类 号:TP393[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象