基于网络爬虫的网页大数据抓取方法仿真  被引量:17

Web Crawler-Based Simulation of Large Data Grabbing Method for Web Pages

在线阅读下载全文

作  者:谢蓉蓉[1] 徐慧[2] 郑帅位 马刚[1] XIE Rong-rong;XU-Hui;ZHENG-Shuai-wei;MA-Gang(School of Computer Science,Xi'an Shiyou University,Xi'an Shanxi 710065,China;School of Shiyou Engineering,Xi’an Shiyou University,Xi’an Shanxi 710065,China;Information Centre,Xi’an Shiyou University,Xi’an,Shanxi 710065,China)

机构地区:[1]西安石油大学计算机学院,陕西西安710065 [2]西安石油大学石油工程学院,陕西西安710065 [3]西安石油大学信息中心,陕西西安710065

出  处:《计算机仿真》2021年第6期439-443,共5页Computer Simulation

摘  要:为了提高网页大数据抓取效率,解决传统抓取方法误差大的问题,提出了基于网络爬虫的网页大数据抓取方法。首先分析网络爬虫运行的基本流程,按流程提取大数据关键特征,然后根据特征提取结果提出基于网络爬虫的数据抓取策略。经计算得到数据关键特征,从而选择广度优先策略抓取数据信息,并利用相重新构建相空间的方式得到爬虫维度,引入关联维数值完成网页大数据抓取,对数据关键特征完成抓取任务。通过仿真结果表明,所提方法对网页大数据的抓取率更好、耗时更短,与其它方法相比具有较高的鲁棒性。In order to improve the efficiency of web big data crawling and reduce large error in traditional methods, this paper puts forward a web big data crawling method based on web crawler. Firstly, the basic running process of network crawler was analyzed, and the key features of big data were extracted. According to the results of feature extraction, the data crawling strategy based on network crawler was proposed. After calculating the key features of the data, the breadth-first strategy was selected to obtain the data information. Meanwhile, the crawler dimension was obtained by reconstructing the phase space. Finally, the correlation dimension value was introduced to complete the crawling of web big data and key features of data. Simulation results show that the proposed method has better big data fetching rate, shorter time consumption and higher robustness than other methods.

关 键 词:大数据抓取 网络爬虫 特征 相空间 关联维 

分 类 号:TP309.2[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象