检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:严磊[1] 丁宾[1] 姚志敏 马勇男[1] 郑涛[1]
机构地区:[1]徐州工程学院信电工程学院,江苏徐州221008 [2]徐州海外科技人才创业基地,江苏徐州221000
出 处:《计算机应用与软件》2015年第2期325-329,333,共6页Computer Applications and Software
基 金:徐州市科技计划项目(XF12C048)
摘 要:随着信息化社会的不断发展,互联网上的数据越来越多,随之也产生了各种各样的搜索引擎,网络爬虫正是为搜索引擎提供数据基础的。由于大多数普通的网络爬虫在数据量巨大时都会因为DNS解析以及url去重而消耗大量的时间,为了更好地改进爬虫的效率,让爬虫在大数据处理时依然拥有良好的性能,使用哈希链表缓存DNS并将DNS解析的效率相对于普通不做DNS优化的爬虫提高了2.5~3倍。再将MD5加密算法以及树相结合设计出一种基于MD5的ud去重树,理论上使得url去重的空间复杂度相对于普通哈希表缩小60倍,而让其查重的时间复杂度接近于O(1)。最终通过实验证明了该设计的数据结构较为良好。With the constant development of informatisation society, there are more and more data on the internet, thereupon a variety of search engines are come into being. The network crawlers in this paper are to provide the data bases for search engines. Since most of common web crawlers will consume a great deal of time while the data amount is huge due to DNS analysis and URL duplicate elimination, in order to improve the efficiency of crawlers better, enabling the crawlers to still have a good performance in processing big data, in this paper we use hash chain to buffer DNS and raise the efficiency of DNS analysis 2.5 ~ 3 times in comparison with ordinary crawlers without DNS optimisation. Then we combine the MD5 encryption algorithm with the tree to design an MD5-hased URL duplicate elimination tree, which theoretically allows the space complexity of URL duplicate elimination 60 times shrunk compared, with normal hash table, and the time complexity of duplicate checking approaching O ( 1 ). At last it is proved through experiment that the data structure designed in the paper is fairly good.
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117