基于语义树与VSM的主题爬取策略研究  被引量:1

Research on Topic Crawling Strategy Based on Semantic Tree and VSM

在线阅读下载全文

作  者:张金 倪晓军[1] 

机构地区:[1]南京邮电大学计算机学院,江苏南京210003

出  处:《计算机技术与发展》2017年第11期66-70,共5页Computer Technology and Development

基  金:教育部专项研究项目(2013116)

摘  要:主题爬虫主要用于解决用户的定制化搜索需求,即在日益增长的网络数据中快速、有效、准确地选取用户关注的主题内容进行爬取。提高爬取特定信息的准确性,需要对网页的内容相关度进行主题相关判断,而主题爬虫关注的核心问题就是相关度计算,但现有的改进算法大多采用人工智能和机器学习等技术,不仅引起算法复杂度的提高,而且提升效果有限。为此,提出了一种基于语义树与VSM的主题爬取策略,将语义相似度的计算加入到内容相关度计算与链接排序中,并通过对策略中算法细节的改进优化相关度的主题判别。实验结果表明,使用基于语义树与VSM爬取策略的主题爬虫可将爬行路线一直保持在相关度较高的网页链接中,对网页链接进行了相关与不相关的有效分类,显著地提高了爬取的准确率。Topic crawler is mainly adopted to solve the customized search needs of users, that is to select the concerning topics of users for crawling quickly, effectively and accurately in the growing network data. In order to improve the accuracy of crawling specific informa- tion ,the relevance of the content of the page needs to be subject-related judgments while the main concern of the topic crawler is the cor- relation calculation. But the most of the existing improved algorithms adopt techniques like artificial intelligence and machine learning, which not only improve their complexity,but also own limitations in effect enhancement. Therefore, a topic crawling strategy based on se- mantic tree and VSM is proposed and the semantic similarity calculation is added to the content relevance calculation and link ranking to optimize the subject discrimination of relevance through the improvement of detail of the algorithm in the strategy. Experimental results show that it can always keep the crawl course in the link of the web page with high relevance, which has effectively classified the web links relevant or not and significantly improved accuracy of crawling.

关 键 词:主题爬虫 语义树 向量空间模型 内容相关度 链接排序 

分 类 号:TP301[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象