检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]湖南省第一师范学院信息科学与工程学院,湖南长沙410205
出 处:《计算机技术与发展》2017年第8期192-194,199,共4页Computer Technology and Development
基 金:湖南省教育科研基金(15C0284)
摘 要:Shark-search是一种依据链接价值的高低进行优先采集的算法,用于主题信息采集系统时由于只考虑了网页文本和链接锚文本与主题的相关性而忽略了网页的组织结构特性,在抓取有较多噪音链接的网页时效果欠佳。基于网页组织结构特性的分析研究,提出了一种基于网页主题分块的Shark-search算法。该算法在经典Shark-search算法的基础上依据网页组织结构根据网页布局标签对页面内容进行分块,从网页,块和链接三个层面与主题的相关性得到链接的综合价值,因而具有自学习功能,能统计学习与主题相关性较大的块特征,并在发生主题漂移的时候具有自调整功能,给予主题相关性较大的父页面上的链接更多被抓取的机会。采集实验结果表明,所提出的算法在经典Shark-search的基础上能较好地改进主题信息采集的查准率,能够更灵活地针对实际的Web资源状况进行自调整。The Shark-search algorithm ranks Web linkages based on their topic value, which only estimates the linkage' s value by pages' text content and linkages' anchor text, not taking into account the link structure of the Web and has not good enough performance in crawling web pages including many linkages irrelevant to topic. An improved Shark-search algorithm based on topical segments has been proposed, which segments the Web page into blocks on the basis of the page' s structure. The linkage' s integrated value is comprised of the parent page' s value,the block' s value and the linkage' s value. Moreover,it regards the visited out links as feedback to modify the block' s relevance resulting with self-learning to statistical the characteristic of blocks. It has the ability of self-adjusting in the case of topic-drift to give more chance to the linkages in the web pages more relevant to the topic. The results of experiment in Web crawler show the algorithm proposed can well improve the precision of topical information acquisition on the basis of the classical Shark-search and more flexibly adjusts according to actual Web resources status.
关 键 词:Shark-search算法 网页分块 Web信息搜集 链接价值 主题漂移
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.225.72.113