一种Deep Web爬虫的设计与实现被引量：5

Design and Implementation of a Deep Web Crawler

出　　处：《计算机与现代化》2009年第3期31-34,共4页Computer and Modernization

摘　　要：随着World Wide Web的快速发展,Deep Web中蕴含了越来越多的可供访问的信息。这些信息可以通过网页上的表单来获取,它们是由Deep Web后台数据库动态产生的。传统的Web爬虫仅能通过跟踪超链接检索普通的SurfaceWeb页面,由于没有直接指向Deep Web页面的静态链接,所以当前大多数搜索引擎不能发现和索引这些页面。然而,与Surface Web相比,Deep Web中所包含的信息的质量更高,对我们更有价值。本文提出了一种利用HtmlUnit框架设计Deep Web爬虫的方法。它能够集成多个领域站点,通过分析查询表单从后台数据库中检索相关信息。实验结果表明此方法是有效的。As the World Wide Web grows rapidly, more and more data become available in the Deep Web. The data can be obtained by submiting form in the Web pages and arise dynamicly from Deep Web database. Traditional Web crawler only can retrieve Surface Web page by following hyperlinks. Since there is no static links to the hidden Web pages, most search engines cannot discover and index such pages. However, compared to surface Web,the information provided by hidden Web sites is often of more high quality and can be more valuable to us. A method of designing deep Web crawler by use of HtmlUnit framework is proposed in this paper. The crawler which integrate several Web sites can analyze form and fill them automatically to retrieve relevant information from the database. The results of a number of experiments carded out with actual Deep Web sites demonstrate the accuracy of the method.

关键词：DEEP WEB WEB爬虫表单

分类号：TP393[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种Deep Web爬虫的设计与实现被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种Deep Web爬虫的设计与实现 被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种Deep Web爬虫的设计与实现被引量：5