检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李莉[1] 刘淼 冯嘉辉 LI Li;LIU Miao;FENG Jia-hui(School of Computer Science and Technology,Changchun University of Science and Technology,ChangChun 130022)
机构地区:[1]长春理工大学计算机科学技术学院,长春130022
出 处:《长春理工大学学报(自然科学版)》2020年第1期97-103,共7页Journal of Changchun University of Science and Technology(Natural Science Edition)
摘 要:随着近年来互联网信息的爆炸式增长,通用网络爬虫成为人们获取信息的有效手段。但其查准率却无法保证。针对此问题,提出一种基于改进BM25算法和SVM算法的聚焦爬虫,用于解决通过网络爬虫的缺点。聚焦爬虫分为网页爬取模块、网页预处理模块和网页关联性评价模块三部分。网页爬取模块以URL种子集合为初始集合负责网页信息的爬取。网页预处理模块采用改进BM25算法提取网页信息的主题特征向量。网页关联性评价模块采用SVM算法对主题特征向量进行分类,获取和用户检索主题相关的网页信息。实验结果表明,本文的方法在网页抓取的查准率上都取得良好的效果。With the explosive growth of Internet information in recent years,general web crawler has become an effective means for people to obtain information.But its accurate rate can not guarantee.To solve this problem,this paper proposes a focused crawler based on improved BM25 algorithm and SVM algorithm to solve the shortcomings of crawlers passing through the web.The focus crawler is divided into three parts:web crawling module,web preprocessing module and web relevance evaluation module.Web page crawler module is responsible for web page information crawler with URL seed set as the initial set.The web page preprocessing module USES the improved BM25 algorithm to extract the topic feature vectors of web page information.The web relevance evaluation module USES the SVM algorithm to classify the theme feature vectors to obtain and retrieve the webpage information related to the theme.Experimental results show that the method in this paper has achieved good results in the accuracy of web page fetching.
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49