检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]上海师范大学信息与机电工程学院,上海200234
出 处:《上海师范大学学报(自然科学版)》2016年第4期448-457,共10页Journal of Shanghai Normal University(Natural Sciences)
摘 要:Nutch的网页更新预测方法采用的是邻比法,相关更新参数需要人为设定,不能自适应调整,无法应对海量网页更新的差异性.为解决这个问题,提出动态选择策略对Nutch的网页更新预测方法进行改进.该策略在网页更新历史数据不足时,通过基于MapReduce的DBSCAN聚类算法来减少爬虫系统抓取网页数量,将样本网页的更新周期作为所属类其他网页的更新周期;在网页更新历史数据较多时,通过对网页更新历史数据进行泊松过程建模,较准确地预测每个网页的更新周期.最后在Hadoop分布式平台下对改进该策略测试.实验结果表明,优化后的网页更新预测方法表现更优.Web page updated prediction method of Nutch is an adjacent method and its relevant update parameters need to be set artificially,not adaptively adjustable, and unable to cope with the differences of massive web page updates. To address this problem, this paper puts forward dynamic selection strategy to improve the method of Nutch web page updated prediction. When the historical updated web page data are insufficient, the strategy uses DBSCAN clustering algorithm based on MapReduce to reduce the number of the pages of the crawler system crawling, the update cycle of the sample web pages is used as update cycle of other pages which are in the same category. When the historical updated web page data are enough, the data are used to model with the Poisson Process, which can more accurately predict each web page update cycle. Finally the improving strategy is tested in the Hadoop distributed platform. The experimental results show that the performance of optimized web page updated prediction method is better.
关 键 词:NUTCH 网页更新预测 基于密度聚类算法 泊松过程 分布式编程
分 类 号:TP311.52[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.224.3.109