基于候选链接主题边缘文本的主题爬虫研究  

FOCUSED CRAWLER BASED ON TOPIC BOUNDARY AROUND AN UNVISITED LINK

在线阅读下载全文

作  者:张环 Zhang Huan(Department of Information Technology,Shandong Vocational College of Special Education,250022,Jinan,China))

机构地区:[1]山东特殊教育职业学院信息技术系,济南250022

出  处:《山东师范大学学报(自然科学版)》2018年第4期421-426,共6页Journal of Shandong Normal University(Natural Science)

摘  要:针对基于文本内容的主题爬虫算法引入过多无关特征属性以及没有考虑出现频次不同的特征属性对相关性判定影响的不足,提出一种基于候选链接主题边缘文本的主题爬虫.使用杜威十进分类法提取锚文本关键词和与锚文本关键词词义相近的网页正文中的关键词,称为候选链接主题边缘文本.在使用朴素贝叶斯分类器进行相关性判定时,对出现频次不同的特征属性进行加权,获取的候选链接按照判定结果的大小顺序存入队列等待下一轮的访问.实验结果表明,该爬虫有效提高了相关网页获取的准确性.There are limitations for the focused crawler based on text content:too many irrelevant feature attributes are introduced and the effects of different feature attributes on correlation prediction are not considered.The Dewey Decimal Classification(DDC)is used to extract the anchor text keywords and web pages keywords that are similar to the meaning of anchor text keywords,which is called topic boundary around an unvisited link.When Naive Bayes text classifier is used to determine the correlation,the feature attributes with different frequency are weighted and the obtained unvisited links are placed in the queue in order of the size of the correlation result for the next round of access.Experimental results show that the focused crawler can improve the accuracy of relevant webpages.

关 键 词:主题爬虫 候选链接 杜威十进分类法 朴素贝叶斯文本分类器 

分 类 号:TP393[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象