排名于后的深层Web数据源爬取

Deep Web Data Source Crawl Based on Ranking from the Bottom

作　　者：郭丽[1] GUO Li(Anhui Vocational College of Electrics&Inforamtion Technology,Bengbu,Anhui 233030,China)

出　　处：《九江学院学报（自然科学版）》2019年第3期69-72,共4页Journal of Jiujiang University：Natural Science Edition

基　　金：安徽省教育厅高校优秀青年人才支持计划重点项目资助(编号gxyqZD2018131);省级重点自然科学研究项目资助(编号KJ2017A665,KJ2017A666);2019年度安徽高校人文社会科学研究重点项目(编号SK2019A0920)的研究成果之一

摘　　要：在大数据时代,绝大多数数据并非来自表面Web,多数需求数据是通过超链接互连的Web引擎。相反,宝贵的数据库通常存在于深层网络中,即隐藏的网络-在查询接口后端。自从众多应用程序,如垂直门户网站,需要深入的Web数据,各种爬行方法都是以最小(或接近最小)的成本收获深度Web数据源。在实践中,数据源通常返回前k个对应的值匹配。这使得详尽的数据收集难度增加:高排名的文件将被多次返回,而文件排名靠后的低排名文件出现可能性很小。文章将此问题分解为两个正交子问题,即基于查询和排序的偏差问题,并提出一个基于频率的爬行方法克服了排序偏差问题。方法是使用文档频率在指定范围内进行查询,避免搜索排名加上返回限制的影响,大大降低了爬行排名靠后的深层数据源挖掘。该方法在各种数据集上进行了广泛的测试与现有的两种方法相比,实验结果证明了文中的方法更加优越。In the era of big data,the vast majority of data was not from the surface web,and most of the demand data was a web engine interconnected by hyperlinks. In contrast,valuable databases usually existed in deep networks,ie,hidden networks-at the back end of the query interface. Since many applications,such as vertical portals,require in-depth web data,various crawling methods were harvesting deep web data sources with minimal (or near-minimum) cost. In practice,the data source usually returned the first k corresponding value matches, which made detailed data collection more difficult: high-ranking files would be returned multiple times,while low-ranking files with lower file rankings were less likely to appear. In this paper,we decomposed this problem into two orthogonal sub-problems,namely the bias problem based on query and sorting, and proposed a frequency-based crawling method to overcome the sorting bias problem. Our research used the document frequency to query within the specified range,avoided the influence of search ranking and return restriction,and greatly reduced the deep data source mining behind the crawling ranking. This method had been extensively tested on various data sets compared to the two existing methods. The experimental results proved that our method was more superior.

关键词：深层网络爬取查询选择文档频率返回限制

分类号：TP391.3[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

排名于后的深层Web数据源爬取

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

排名于后的深层Web数据源爬取

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索