Nutch中网页更新预测研究与优化  被引量:1

Research and optimization of page updated forecast on Nutch

在线阅读下载全文

作  者:胡伟[1] 吴海涛[1] 

机构地区:[1]上海师范大学信息与机电工程学院,上海200234

出  处:《上海师范大学学报(自然科学版)》2016年第4期448-457,共10页Journal of Shanghai Normal University(Natural Sciences)

摘  要:Nutch的网页更新预测方法采用的是邻比法,相关更新参数需要人为设定,不能自适应调整,无法应对海量网页更新的差异性.为解决这个问题,提出动态选择策略对Nutch的网页更新预测方法进行改进.该策略在网页更新历史数据不足时,通过基于MapReduce的DBSCAN聚类算法来减少爬虫系统抓取网页数量,将样本网页的更新周期作为所属类其他网页的更新周期;在网页更新历史数据较多时,通过对网页更新历史数据进行泊松过程建模,较准确地预测每个网页的更新周期.最后在Hadoop分布式平台下对改进该策略测试.实验结果表明,优化后的网页更新预测方法表现更优.Web page updated prediction method of Nutch is an adjacent method and its relevant update parameters need to be set artificially,not adaptively adjustable, and unable to cope with the differences of massive web page updates. To address this problem, this paper puts forward dynamic selection strategy to improve the method of Nutch web page updated prediction. When the historical updated web page data are insufficient, the strategy uses DBSCAN clustering algorithm based on MapReduce to reduce the number of the pages of the crawler system crawling, the update cycle of the sample web pages is used as update cycle of other pages which are in the same category. When the historical updated web page data are enough, the data are used to model with the Poisson Process, which can more accurately predict each web page update cycle. Finally the improving strategy is tested in the Hadoop distributed platform. The experimental results show that the performance of optimized web page updated prediction method is better.

关 键 词:NUTCH 网页更新预测 基于密度聚类算法 泊松过程 分布式编程 

分 类 号:TP311.52[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象