融合动态主题词库和改进Shark-Search算法的主题爬虫方法——以武器装备领域为例  被引量:2

Crawler with Dynamic Thesaurus and Improved Shark-Search Algorithm:Case Study of Military Equipment

在线阅读下载全文

作  者:丁晟春[1] 刘凯[1] 方振 Ding Shengchun;Liu Kai;Fang Zhen(School of Economics and Management,Nanjing University of Science&Technology,Nanjing 210094,China)

机构地区:[1]南京理工大学经济管理学院,南京210094

出  处:《数据分析与知识发现》2022年第8期52-60,共9页Data Analysis and Knowledge Discovery

基  金:江苏省社会科学基金项目(项目编号:20TQB004)的研究成果之一。

摘  要:【目的】解决传统主题爬虫容易出现爬取率低和主题相关度不足的问题。【方法】基于Shark-Search算法,提出两步式动态扩充主题词表的主题爬虫算法Two-step Dynamic Shark-Search(TDSS),将传统算法中主题相关性计算拆分为链接主题相关性和页面主题相关性两个单独步骤。通过相关资料和工具建立并拓展的主题词表,并在爬虫运行过程中从主题相关页面提取新的关键词补充到主题词表中,提升主题判断的效果。【结果】在相同的实验环境下,TDSS主题爬虫方法比对比算法的爬准率最多高14.2%,采集效率最多高35%。【局限】动态主题词扩展算法需进一步完善,主题词表过度扩充会降低爬准率。【结论】基于TDSS的主题爬虫能够有效提高获取主题信息的准确率,爬取更多与主题相关的网页。[Objective]This paper tries to address the issues facing traditional theme crawlers,such as low indexing rates and insufficient theme relevance.[Methods]We proposed a Two-step Dynamic Shark-Search(TDSS)algorithm based on Shark-Search,which divided the topic relevance calculation into the relevance of hyperlink and webpage topics.Then,we added new keywords extracted from topic-related pages to the established topic thesaurus,which improved the effectiveness of topic judgment.[Results]The TDSS crawler’s accuracy and indexing efficiency were 14.2%and 35%higher than the comparable algorithms in the same experiment environment.[Limitations]More research is needed to increase the clawer’s accuracy with excessive topic words.[Conclusions]The proposed algorithm could effectively improve the accuracy of topic information and retrieve more topic-related webpages.

关 键 词:主题爬虫 Shark-Search 主题相关度 主题词表 

分 类 号:E91[军事] TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象