一种面向大规模网页去重的三层分布式架构被引量：2

A Three Layer Distributed Architecture for Large-Scale Duplicated Web Page Detection

出　　处：《计算机与数字工程》2015年第10期1751-1755,共5页Computer & Digital Engineering

摘　　要：去除重复网页是网页爬取过程中必要的步骤,目前人们对网页去重方法的研究集中在基于网页内容相似的去重算法本身的准确性和算法复杂度上。论文提出一种面向大规模网页去重的三层分布式架构,其利用本地缓存、分布式缓存及分布式索引高效地判断重复网页,特别适用于网页内容更新频繁需要反复爬取的应用场景。实验分析结果表明论文提出的三层分布式架构可以支持分布式网络爬虫环境下大规模的网页去重需求,并且具有较好的可扩展性。Duplicated web page detection is a necessary step.Currently,researchers focus on the accuracy and time complexity of duplicated web pages detection algorithms based on the similarity of web page content.A three layer distributed architecture for large-scale duplicated web page detection is proposed,which can detect duplicated web pages efficiently using the combination of local memory caches,distributed caches and distributed index.This architecture is especially applicable for those applications which need crawl the web pages repeatedly.The experimental results indicate our proposed architecture can satisfy the requirement of large scale duplicated web page detection in distribute web crawler application.Moreover,this architecture is scalable.

关键词：网页去重网络爬虫分布式架构

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种面向大规模网页去重的三层分布式架构被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种面向大规模网页去重的三层分布式架构 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种面向大规模网页去重的三层分布式架构被引量：2