基于多元数据信息获取的关键技术研究  被引量:4

Research on Key Technologies of Information Acquisition Based on Multivariate Data

在线阅读下载全文

作  者:李莉[1] 刘淼 冯嘉辉 LI Li;LIU Miao;FENG Jia-hui(School of Computer Science and Technology,Changchun University of Science and Technology,ChangChun 130022)

机构地区:[1]长春理工大学计算机科学技术学院,长春130022

出  处:《长春理工大学学报(自然科学版)》2020年第1期97-103,共7页Journal of Changchun University of Science and Technology(Natural Science Edition)

摘  要:随着近年来互联网信息的爆炸式增长,通用网络爬虫成为人们获取信息的有效手段。但其查准率却无法保证。针对此问题,提出一种基于改进BM25算法和SVM算法的聚焦爬虫,用于解决通过网络爬虫的缺点。聚焦爬虫分为网页爬取模块、网页预处理模块和网页关联性评价模块三部分。网页爬取模块以URL种子集合为初始集合负责网页信息的爬取。网页预处理模块采用改进BM25算法提取网页信息的主题特征向量。网页关联性评价模块采用SVM算法对主题特征向量进行分类,获取和用户检索主题相关的网页信息。实验结果表明,本文的方法在网页抓取的查准率上都取得良好的效果。With the explosive growth of Internet information in recent years,general web crawler has become an effective means for people to obtain information.But its accurate rate can not guarantee.To solve this problem,this paper proposes a focused crawler based on improved BM25 algorithm and SVM algorithm to solve the shortcomings of crawlers passing through the web.The focus crawler is divided into three parts:web crawling module,web preprocessing module and web relevance evaluation module.Web page crawler module is responsible for web page information crawler with URL seed set as the initial set.The web page preprocessing module USES the improved BM25 algorithm to extract the topic feature vectors of web page information.The web relevance evaluation module USES the SVM algorithm to classify the theme feature vectors to obtain and retrieve the webpage information related to the theme.Experimental results show that the method in this paper has achieved good results in the accuracy of web page fetching.

关 键 词:聚焦爬虫 BM25 SVM 向量空间模型 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象