一种新的主题网络爬虫爬行策略  被引量:15

A NOVEL CRAWLING STRATEGY OF FOCUSED WEB CRAWLER

在线阅读下载全文

作  者:宋海洋[1] 刘晓然[1] 钱海俊[1] 

机构地区:[1]海军指挥学院信息战研究系,江苏南京211800

出  处:《计算机应用与软件》2011年第11期264-267,293,共5页Computer Applications and Software

摘  要:为了解决传统主题网络爬虫准确度低或者爬行速度慢的问题,提出一种新的主题网络爬虫爬行策略,主要针对"二次爬行"过程进行改进。在传统的主题网络爬虫流程中增加一份"经验树",将基于内容分析和基于链接分析两种不同的相关度分析算法相结合,并且可以保存爬虫爬行过程中所得到的"经验",实现对后续爬行的指导。实验结果表明通过改进后的策略实现的主题网络爬虫在性能上有较大提升。In order to solve the problem of low accuracy or slow speed of traditional focused crawler, a new crawling strategy is proposed. The improvement is mainly directed at the "second crawl" process. An "experience tree" is added to the traditional process of focused crawler, which can efficiently combine two correlation analysis algorithms based on content analysis and link analysis respectively, it can also save "experience" during the crawling so as to achieve the guidance on subsequent crawl. Experimental results show that the focused crawler based on the improved strategy performs much better than the traditional ones.

关 键 词:主题网络爬虫 爬行策略 二次爬行 相关度分析 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象