检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:董禹龙 杨连贺[1] 马欣[1] DPNG Yu -long ,YANG Lian -he ,MA Xin(School of Computer Science and Software Engineering,Tianjin Polytechnic University,Tianjin 300387, Chin)
机构地区:[1]天津工业大学计算机科学与软件学院,天津300387
出 处:《计算机科学》2018年第B06期428-432,共5页Computer Science
摘 要:针对当前分布式网络爬虫方法遇到的处理效率、扩展性、可靠性、任务分配和负载平衡等问题,提出了一种主动获取任务式的分布式网络爬虫方法。该方法在子机节点中加入分控模块,评估节点负载及运行状况,并主动向中控节点申请任务队列。在此基础上,结合动态双向优先级任务分配算法,设计了一种具有负载平衡、任务分级分配、节点异常敏捷识别、节点安全退出等特性的分布式网络爬虫模型。实际测试表明,该主动获取式的分布式网络爬虫方法可有效地利用通用平台建立大型分布式爬虫集群。In this paper,in order to solve the processing efficiency,scalability,task allocation and load balance problem existed in the present distributed web crawler method,an active acquisition task distributed web crawler method was proposed,in which a sub-controlled module is added into the sub-node to evaluate the node load and operation status,and apply task queue for the central control node.Based on this method as well as the dynamic dual-directional priority task allocation algorithm,a distributed network crawler model was designed,which has the characteristics of load balance,task hierarchical allocation,abnormal node smart identification and safe exit,etc.The practice test shows that the active acquisition task distributed web crawler method can be used to build large-scale distributed crawler cluster effectively.
关 键 词:主动获取 分布式爬虫 负载平衡 爬虫框架 多进程 动态优先级
分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.53