引入主题链接块因子的候选链接搜索策略研究  被引量:1

Research of Searching Strategy in Candidate Link Introducing Topic Link Blocking Factor

在线阅读下载全文

作  者:周雪[1] 刘乃文[2] ZHOU Xue;LIU Naiwen(School of Information Science and Engineering,Shandong Normal University,Jinan 250014;Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology,Jinan 250014)

机构地区:[1]山东师范大学信息科学与工程学院,济南250014 [2]山东省分布式计算机软件新技术重点实验室,济南250014

出  处:《计算机与数字工程》2018年第5期874-878,共5页Computer & Digital Engineering

摘  要:网页主题爬取过程中,需要计算网页中出现的url权重,不断填充待爬行队列,以满足爬行条件,如何发现与主题最相关的链接,同时又不会导致"主题漂移"问题是关键。针对链接的锚文本较短小,不能很好地表明链接指向页面与主题的相关性的问题,论文在Shark-search算法的基础上引入相关链接块权重,利用块中子链接的锚文本进行块的权重计算,通过对比实验验证了改进算法可以更好地区分处于同一页面中的链接的相关度评分,提高爬虫的查准率,同时缓和"主题漂移"的问题。In crawling process,the urls' weight is need to compute,the crawl queue is filled to meet the crawl conditions. It's the key problem that how to find the most relevant links to the theme and how to avoid "theme drift" problem. Due to anchor text is short,it can't clearly show the page's relevance to the topic which the page linked to. On the basis of Shark-search algorithm introducing the related link weights,the neutron link anchor text is used for calculating blocks' weight. Through contrasted experiments,verified the effectiveness of the improved algorithm is verfied,it can better distinguish the links' relevance score in the same page,improve the precision of the crawler and moderate "theme drift" problem at the same time.

关 键 词:网页分块 Shark-search算法 链接结构 主题链接块 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象