主动获取式的分布式网络爬虫集群方法研究被引量：14

Study on Active Acquisition of Distributed Web Crawler Cluster

作　　者：董禹龙杨连贺[1] 马欣[1] DPNG Yu -long ,YANG Lian -he ,MA Xin(School of Computer Science and Software Engineering,Tianjin Polytechnic University,Tianjin 300387, Chin)

机构地区：[1]天津工业大学计算机科学与软件学院,天津300387

出　　处：《计算机科学》2018年第B06期428-432,共5页Computer Science

摘　　要：针对当前分布式网络爬虫方法遇到的处理效率、扩展性、可靠性、任务分配和负载平衡等问题,提出了一种主动获取任务式的分布式网络爬虫方法。该方法在子机节点中加入分控模块,评估节点负载及运行状况,并主动向中控节点申请任务队列。在此基础上,结合动态双向优先级任务分配算法,设计了一种具有负载平衡、任务分级分配、节点异常敏捷识别、节点安全退出等特性的分布式网络爬虫模型。实际测试表明,该主动获取式的分布式网络爬虫方法可有效地利用通用平台建立大型分布式爬虫集群。In this paper,in order to solve the processing efficiency,scalability,task allocation and load balance problem existed in the present distributed web crawler method,an active acquisition task distributed web crawler method was proposed,in which a sub-controlled module is added into the sub-node to evaluate the node load and operation status,and apply task queue for the central control node.Based on this method as well as the dynamic dual-directional priority task allocation algorithm,a distributed network crawler model was designed,which has the characteristics of load balance,task hierarchical allocation,abnormal node smart identification and safe exit,etc.The practice test shows that the active acquisition task distributed web crawler method can be used to build large-scale distributed crawler cluster effectively.

关键词：主动获取分布式爬虫负载平衡爬虫框架多进程动态优先级

分类号：TP301.6[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

主动获取式的分布式网络爬虫集群方法研究被引量：14

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

主动获取式的分布式网络爬虫集群方法研究 被引量：14

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

主动获取式的分布式网络爬虫集群方法研究被引量：14