基于Python的新闻聚合系统网络爬虫研究被引量：8

Research on the Python-based Web Crawler for News Aggregation System

作　　者：左卫刚 ZUO Wei-gang(Shanxi Management Vocational College, Linfen Shanxi 041051, China)

出　　处：《长春师范大学学报》2018年第12期29-33,共5页Journal of Changchun Normal University

摘　　要：本文开发了一套基于Python的网络爬虫,并预留API,从而构建一个新闻聚合系统。新闻聚合系统中的新闻数据需要爬虫来获取,然而不同的网站有不同的页面布局,本研究旨在创建一个能够从不同页面布局中提取数据的开源爬虫,其中包括网络爬虫、API、网络爬虫调度器以及Socket服务器的实现等。开发过程中使用Python语言开发网络爬虫,利用Beautiful Soup作为网络爬虫的web提取工具,以Laravel为web应用程序框架,以PHP作为主要后端语言,承载CMS和API。网络爬虫可以通过利用用户创建的配置文件来适应从不同的页面布局中提取数据,并将提取的数据导出到JSON文件或数据库系统中。This paper develops a Python-based web crawler and reserves the API to construct a news aggregation system.The news data in the news aggregation system needs to be acquired by the web crawler,but different websites have different page layouts.This study aims to create an open source crawler able to extract data from different page layouts,including the implementation of web crawler,API,web crawler scheduler and Socket server.In the development process,Python language is applied to develop the network crawler,BeautifulSoup is used as the web extraction tool of the web crawler,Laravel is adopted as the web application framework,PHP is used as the main back-end language to support CMS and API.Web crawler can adapt to different page layouts and extract data from them by using the configuration files created by users and export the extracted data to JSON file or database system.

关键词：网络爬虫 HTML提取新闻管理系统应用程序接口

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Python的新闻聚合系统网络爬虫研究被引量：8

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Python的新闻聚合系统网络爬虫研究 被引量：8

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于Python的新闻聚合系统网络爬虫研究被引量：8