全自动网页信息采集系统被引量：5

Automatic Extraction System of Webpage Information

出　　处：《长春理工大学学报（自然科学版）》2015年第2期151-154,共4页Journal of Changchun University of Science and Technology(Natural Science Edition)

摘　　要：随着网络时代的快速发展,用户对搜索引擎、网页的内容和大数据处理等有了更多的要求。从海量的互联网信息中选取最符合要求的信息成为了新的热点。基于一个开源的、Java开发的、可扩展的Web爬虫项目—Heritrix,进行扩展抓取用户需要的网页,深入研究了信息采集技术。利用Heritrix的可扩展性,来实现用户的抓取。通过分析Heritrix的工作流程,模块划分以及源码设计,基于Heritrix扩展抽取面向商品信息的网页,配合Html Parser对网页内容进行解析,有效的提取商品关键信息后存入数据库以供检索。With the rapid development of the internet age, users have put forward more requirements for search en-gines,content of webpage and large data processing etc. Selecting the required information from the internet information with mass data has become a new hotspot. In this paper, extensible webcrawler project- Heritrix, which is an open source and developed by Java, is extended to capture user webpage. The information collection technology is further studied. Extendibility of Heritrix is used to realize a user’s capture. Through the analysis of the working process of Heritrix, module allocation and source code design, based on webpage extraction facing product information with Heri-trix extendibility and webpage content analysis with HtmlParser, key product information is extracted effectively, which is stored in the database for retrieval.

关键词：HERITRIX HTMLPARSER 网络爬虫信息提取

分类号：TP393.02[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

全自动网页信息采集系统被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

全自动网页信息采集系统 被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

全自动网页信息采集系统被引量：5