Scrapy框架下分布式网络爬虫数据采集算法仿真  被引量:9

Simulation of Distributed Web Crawler Data Collection Algorithm under Scrapy Framework

在线阅读下载全文

作  者:刘多林[1] 吕苗 LIU Duo-lin;LV Miao(College of Economy and Management,Shenyang Ligong University,Shenyang Liaoning 110159,China)

机构地区:[1]沈阳理工大学,辽宁沈阳110159

出  处:《计算机仿真》2023年第6期504-508,共5页Computer Simulation

基  金:2021年辽宁省教育厅高等学校基本科研项目(面上重点项目)(LJKR0114)。

摘  要:为提高数据采集速度、避免重复采集,提出Scrapy框架下分布式网络爬虫数据采集算法。利用搜索引擎、调度器、下载器、数据解析等模块建立Scrapy框架,明确爬虫体系内包括分布式计算与储存两部分;为确保爬虫过程负载均衡,将爬虫速度作为评价指标,计算节点权重;使用蚁群优化算法,采用伪随机规则,获取智能体的网页转移概率,确定爬取路径,更新每条路径的信息素浓度,根据目标函数距离选取目标解;综合分析数据特征向量,计算链接的主题相似度,将相似度较高的链接放入待爬取集合中,得出数据间的重合度影响因子,避免重复采集,当信息素浓度降到最低时停止爬虫操作,完成采集工作。仿真结果证明,所提方法爬准率与爬全率较高,可提升数据采集速度。In order to accelerate data collection and avoid repeated collection,this paper puts forward an algorithm for distributed web crawler data collection under the framework of Scrapy.Firstly,we used the modules such as search engine,scheduler,downloader and data analyzer to construct the Scrapy framework.And we found that the crawler system included two parts:distributed computing and storage.In order to ensure the crawling load balance,we took the crawler speed as an evaluation index to calculate the node weight.Secondly,we used the ant colony optimization algorithm and pseudo-random rule to obtain the web page transfer probability of the agent,thus determining the craw-ling path and updating the pheromone concentration of each path.According to the distance of the objective function,we selected the target solution.Furthermore,we comprehensively analyzed the data feature vector and calculated the subject similarity of links.And then we put the links with high similarity into the set that needed to be crawled,and thus to obtain the influence factors of the coincidence degree between data.In this way,we avoided repeated data col-lection.When the pheromone concentration decreased to the lowest,we stopped crawling.Finally,we completed the data collection.The simulation result proves that the proposed method has a higher precision and recall ratio,which can accelerate the data collection.

关 键 词:分布式系统 网络爬虫 数据采集 节点权重 

分 类 号:TP301[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象