基于双缓冲的分布式爬虫调度策略的设计与研究  被引量:4

Design and Research of Distributed Reptile Scheduling Strategy Based on Double Buffer

在线阅读下载全文

作  者:卢照 师军[2] 张耀午 王琦 LU Zhao;SHI Jun;ZHANG Yaowu;WANG Qi(School of Mathematics and Information Technology,Yuncheng University,Yuncheng 044000;School of Computer Science,Shaanxi Normal University,Xi'an 710100)

机构地区:[1]运城学院数学与信息技术学院,运城044000 [2]陕西师范大学计算机科学学院,西安710100

出  处:《计算机与数字工程》2022年第8期1686-1690,共5页Computer & Digital Engineering

基  金:运城学院应用研究项目(编号:XK-2018039/CY-2019038)资助。

摘  要:互联网的高速发展使得大数据的应用越来越广泛,使得分布式爬虫处于愈来愈重要的地位。目前主流开源爬虫框架在网络通信开销上优化甚少,缺乏一个有效的方案来减少网络开销问题。论文利用对等式架构的爬行器既是任务的消费者又是任务的生产者,提出了任务尽量在本地执行的优化方向。基于双缓冲技术实现的大粒度任务动态负载均衡策略,能有效地降低通信频次,基于高速缓存原理的URL判重方案,以“空间换时间”的方式,有效地提升爬虫URL判重性能。实验结果表明,该策略具有良好的扩展性、鲁棒性,能使分布式系统的性能优势得到更为充分的发挥。With the rapid development of the Internet,the application requirements of big data are becoming more and more extensive,making distributed crawlers in an increasingly important position. At present,mainstream open source crawler frameworks have little optimization on network communication overhead,and lack an effective solution to reduce network overhead. This article uses the peer-to-peer crawler to be both the consumer and the producer of the task,and proposes an optimization direction in which the task should be performed locally as much as possible. The dynamic load balancing strategy for large-grained tasks based on double-buffering technology can effectively reduce the communication frequency. The URL weighting scheme based on the cache principle effectively improves the crawler URL weighting performance by "space-for-time". Experimental results show that the strategy has good scalability and robustness,and can make the performance advantages of distributed systems more fully play.

关 键 词:分布式爬虫 动态负载均衡 Scrapy-Redis 双缓冲机制 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象