基于增量式爬虫技术的新闻分析系统设计  

Design of news analysis system based on incremental crawler technology

在线阅读下载全文

作  者:王龙霄 李健[1] 沈丽民 Wang Longxiao;Li Jian;Shen Limin(Luoyang Campus of Information Engineering University of the Strategic Support Force,Luoyang 471003,China)

机构地区:[1]战略支援部队信息工程大学洛阳校区,洛阳471003

出  处:《现代计算机》2023年第9期117-120,共4页Modern Computer

摘  要:新闻网站是获取外界信息的重要渠道,为有效收集新闻网站信息、对信息进行分析,基于Python设计了对新闻网站的爬虫分析系统。该系统包括爬虫、自然语言处理、系统可视化交互三个模块。在爬虫方面系统采用threading第三方库提供的多线程爬虫,并增加了增量式爬虫的设计;在自然语言处理方面,系统以TextRank算法为原理实现对文本信息对关键词句的抽取,使用TextRank4zh第三方库实现此功能。系统采取Tornado框架实现交互功能。系统以美国有限电视新闻网为例进行新闻信息爬取与分析,实验结果表明该系统的爬取效率高,健壮性好。News website is an important way to obtain information.In order to effectively collect news website information and analyze the information,a crawler and analysis system based on Python is designed.The system includes three modules:crawler,natural language processing,and visualization and interaction system.In terms of crawler,the system adopts the multi‑threaded crawler provided by the third‑party library of threading,and adds the design of incremental crawler;in terms of natural language processing,the system uses the TextRank algorithm as the principle to realize the extraction of key words and phrases from text information,and uses the third‑party library of TextRank4zh to realize this function.The system adopts the Tornado framework to realize the interactive function.The system takes American limited TV news network as an example to crawl and analyze news information,and the experimental results show that the system completes the crawling and analysis of news information.The experimental results show that the system has high crawling efficiency and good robustness.

关 键 词:Python爬虫 Tornado框架 TextRank算法 新闻关键词抽取 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术] G210.7[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象