检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]东华大学计算机科学与技术学院,上海201620
出 处:《计算机工程与设计》2016年第10期2679-2684,共6页Computer Engineering and Design
基 金:中央高校基本科研业务费专项基金项目(2232013D3)
摘 要:为充分利用爬行主题与待访问页面之间的语义关系,提高主题爬虫的整体性能,在研究概念背景图(CCG)主题爬行算法的基础上,提出改进的CCG主题爬行算法。利用HITS算法选取高质量的主题背景知识,根据形式概念分析理论构建概念格模型,将概念格生成CCG用以存储用户查询意向;利用CCG综合父网页、锚文本、链接上下文以及URL自身预测链接主题相关度,过滤不相关页面。实验结果表明,改进的爬行算法有效提高了网页抓取的精度和召回率,具有较强的可行性。To make full use of the semantic relationship between the crawling topic and the unvisited page and to improve the performance of topic-crawler,the concept context graph(CCG)focused crawling algorithm was analyzed and an optimization algorithm based on it was put forward.The high-quality topic context knowledge was selected using HITS algorithm,the concept lattice was constructed with formal concept analysis theory and the CCG was produced to store the query intention of user.The predict similarity of unvisited page was computed using CCG and the similarity of parents page,anchor text,link context and URL information were all taken into consideration synthetically.Experimental results show that the precision and recall rate of the optimization crawling algorithm are better,and it possesses higher availability.
关 键 词:主题爬虫 形式概念分析 概念格 概念背景图 链接预测
分 类 号:TP393[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28