基于概念背景图的主题爬虫设计与实现  被引量:5

Design and implementation of focused crawler based on concept context graph

在线阅读下载全文

作  者:关卫国[1] 骆永成[1] 

机构地区:[1]东华大学计算机科学与技术学院,上海201620

出  处:《计算机工程与设计》2016年第10期2679-2684,共6页Computer Engineering and Design

基  金:中央高校基本科研业务费专项基金项目(2232013D3)

摘  要:为充分利用爬行主题与待访问页面之间的语义关系,提高主题爬虫的整体性能,在研究概念背景图(CCG)主题爬行算法的基础上,提出改进的CCG主题爬行算法。利用HITS算法选取高质量的主题背景知识,根据形式概念分析理论构建概念格模型,将概念格生成CCG用以存储用户查询意向;利用CCG综合父网页、锚文本、链接上下文以及URL自身预测链接主题相关度,过滤不相关页面。实验结果表明,改进的爬行算法有效提高了网页抓取的精度和召回率,具有较强的可行性。To make full use of the semantic relationship between the crawling topic and the unvisited page and to improve the performance of topic-crawler,the concept context graph(CCG)focused crawling algorithm was analyzed and an optimization algorithm based on it was put forward.The high-quality topic context knowledge was selected using HITS algorithm,the concept lattice was constructed with formal concept analysis theory and the CCG was produced to store the query intention of user.The predict similarity of unvisited page was computed using CCG and the similarity of parents page,anchor text,link context and URL information were all taken into consideration synthetically.Experimental results show that the precision and recall rate of the optimization crawling algorithm are better,and it possesses higher availability.

关 键 词:主题爬虫 形式概念分析 概念格 概念背景图 链接预测 

分 类 号:TP393[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象