基于MD5去重树的网络爬虫的设计与优化被引量：10

DESIGN AND OPTIMISATION OF MD5 DUPLICATE ELIMINATION TREE-BASED NETWORK CRAWLER

机构地区：[1]徐州工程学院信电工程学院,江苏徐州221008 [2]徐州海外科技人才创业基地,江苏徐州221000

出　　处：《计算机应用与软件》2015年第2期325-329,333,共6页Computer Applications and Software

基　　金：徐州市科技计划项目(XF12C048)

摘　　要：随着信息化社会的不断发展，互联网上的数据越来越多，随之也产生了各种各样的搜索引擎，网络爬虫正是为搜索引擎提供数据基础的。由于大多数普通的网络爬虫在数据量巨大时都会因为DNS解析以及url去重而消耗大量的时间，为了更好地改进爬虫的效率，让爬虫在大数据处理时依然拥有良好的性能，使用哈希链表缓存DNS并将DNS解析的效率相对于普通不做DNS优化的爬虫提高了2．5～3倍。再将MD5加密算法以及树相结合设计出一种基于MD5的ud去重树，理论上使得url去重的空间复杂度相对于普通哈希表缩小60倍，而让其查重的时间复杂度接近于O（1）。最终通过实验证明了该设计的数据结构较为良好。With the constant development of informatisation society, there are more and more data on the internet, thereupon a variety of search engines are come into being. The network crawlers in this paper are to provide the data bases for search engines. Since most of common web crawlers will consume a great deal of time while the data amount is huge due to DNS analysis and URL duplicate elimination, in order to improve the efficiency of crawlers better, enabling the crawlers to still have a good performance in processing big data, in this paper we use hash chain to buffer DNS and raise the efficiency of DNS analysis 2.5 ～ 3 times in comparison with ordinary crawlers without DNS optimisation. Then we combine the MD5 encryption algorithm with the tree to design an MD5-hased URL duplicate elimination tree, which theoretically allows the space complexity of URL duplicate elimination 60 times shrunk compared, with normal hash table, and the time complexity of duplicate checking approaching O （ 1 ）. At last it is proved through experiment that the data structure designed in the paper is fairly good.

关键词：搜索引擎网络爬虫哈希链表去重树

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于MD5去重树的网络爬虫的设计与优化被引量：10

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于MD5去重树的网络爬虫的设计与优化 被引量：10

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于MD5去重树的网络爬虫的设计与优化被引量：10