检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:郭丽[1] GUO Li(Anhui Vocational College of Electrics&Inforamtion Technology,Bengbu,Anhui 233030,China)
机构地区:[1]安徽电子信息职业技术学院
出 处:《九江学院学报(自然科学版)》2019年第3期69-72,共4页Journal of Jiujiang University:Natural Science Edition
基 金:安徽省教育厅高校优秀青年人才支持计划重点项目资助(编号gxyqZD2018131);省级重点自然科学研究项目资助(编号KJ2017A665,KJ2017A666);2019年度安徽高校人文社会科学研究重点项目(编号SK2019A0920)的研究成果之一
摘 要:在大数据时代,绝大多数数据并非来自表面Web,多数需求数据是通过超链接互连的Web引擎。相反,宝贵的数据库通常存在于深层网络中,即隐藏的网络-在查询接口后端。自从众多应用程序,如垂直门户网站,需要深入的Web数据,各种爬行方法都是以最小(或接近最小)的成本收获深度Web数据源。在实践中,数据源通常返回前k个对应的值匹配。这使得详尽的数据收集难度增加:高排名的文件将被多次返回,而文件排名靠后的低排名文件出现可能性很小。文章将此问题分解为两个正交子问题,即基于查询和排序的偏差问题,并提出一个基于频率的爬行方法克服了排序偏差问题。方法是使用文档频率在指定范围内进行查询,避免搜索排名加上返回限制的影响,大大降低了爬行排名靠后的深层数据源挖掘。该方法在各种数据集上进行了广泛的测试与现有的两种方法相比,实验结果证明了文中的方法更加优越。In the era of big data,the vast majority of data was not from the surface web,and most of the demand data was a web engine interconnected by hyperlinks. In contrast,valuable databases usually existed in deep networks,ie,hidden networks-at the back end of the query interface. Since many applications,such as vertical portals,require in-depth web data,various crawling methods were harvesting deep web data sources with minimal (or near-minimum) cost. In practice,the data source usually returned the first k corresponding value matches, which made detailed data collection more difficult: high-ranking files would be returned multiple times,while low-ranking files with lower file rankings were less likely to appear. In this paper,we decomposed this problem into two orthogonal sub-problems,namely the bias problem based on query and sorting, and proposed a frequency-based crawling method to overcome the sorting bias problem. Our research used the document frequency to query within the specified range,avoided the influence of search ranking and return restriction,and greatly reduced the deep data source mining behind the crawling ranking. This method had been extensively tested on various data sets compared to the two existing methods. The experimental results proved that our method was more superior.
分 类 号:TP391.3[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.120