基于Python爬虫技术的网页数据抓取方法  被引量:5

Web Page Data Crawling Method Based on Python Crawler Technology

在线阅读下载全文

作  者:刘萍 LIU Ping(Yancheng Kindergarten Teachers College,Yancheng Jiangsu 224000,China)

机构地区:[1]盐城幼儿师范高等专科学校,江苏盐城224000

出  处:《信息与电脑》2022年第14期169-171,共3页Information & Computer

基  金:2021年度广东省普通高校重点科研平台-高职院校产教融合创新平台项目“5G8K超高清新场景应用产教融合创新平台”(项目编号:2021CJPT002)。

摘  要:由于对网页数据的爬取存在一定的完整性问题,导致爬取质量和效率较低,为此提出基于Python爬虫技术的网页数据抓取方法。首先,以网页数据的非线性时间序列关键点为节点,构建Python爬虫网络相空间格局;其次,利用Python爬虫技术在划分的网络相空间格局内抓取目标数据;最后,Python爬虫以爬取目标为数据特性,以空间划分结果为基础,对库中的页面进行个性化标记,抽取页面包含的所有统一资源定位系统(Uniform Resource Locator,URL)信息,将其与已抓取队列信息进行比较,确定爬取结果的完整性。测试结果表明,设计方法可以适应不同网络环境,实现对网页数据的快速、有效抓取。Since the crawling of web data has certain integrity problems, resulting in low crawling quality and efficiency, for this reason,a web data crawling method based on Python crawling technology is proposed. Firstly, the non-linear time series key points of web data are used as nodes to construct the Python crawler web phase space pattern;secondly, the Python crawler technology is used to crawl the target data within the divided web phase space pattern;finally, the Python crawler uses the crawl target as data characteristics, takes the space division results as the basis, personalizes the pages in the library to mark, extracts the pages containing Finally, the Python crawler compares all the Uniform Resource Locator(URL) information contained in the page with the crawled queue information to determine the completeness of the crawling results. The test results show that the design method can adapt to different network environments and achieve fast and effective crawling of web data.

关 键 词:Python爬虫技术 网页数据抓取 非线性时间序列 

分 类 号:TP393[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象