基于LDA扩展主题词库的主题爬虫研究  被引量:13

FOCUSED CRAWLER BASED ON LDA EXTENDED TOPIC TERMS

在线阅读下载全文

作  者:费晨杰 刘柏嵩[1] Fei Chenjie;Liu Baisong(College of Information Science and Engineering,Ningbo University,Ningbo 315211,Zhejiang,China)

机构地区:[1]宁波大学信息科学与工程学院,浙江宁波315211

出  处:《计算机应用与软件》2018年第4期49-54,共6页Computer Applications and Software

基  金:国家社会科学基金项目/后期资助项目(15FTQ002);省部级实验室/开放基金项目(B2014)

摘  要:主题爬虫的目的在于尽可能准确地获取与特定主题相关的内容。针对主题爬虫主题覆盖率不足和主题相似度计算准确度偏低,提出一种动态主题的主题爬虫框架,对主题关键词进行两重扩展:用同主题的词扩展和词的语义扩展。利用主题爬虫自身主题相关资源收集的功能,不断对语料进行扩充,通过LDA训练得到主题文档来进行主题词库扩展更新。在此基础上,提出一种基于word2vec词向量表示的改进相似度计算模型,用于页面相似度计算和URL优先级排序。通过在真实新闻数据集上的实验表明,提出的爬虫在主题相关度的判断准确度和主题内容收获率上均有较好表现。The purpose of a focused crawler is to get as much content as possible related to a particular topic.In view of the lack of focused crawler coverage and the low accuracy of topic similarity calculation,this paper proposed a focused crawler framework with dynamic theme,which expanded the theme keywords in two ways:expansion of the word with the subject and semantic extension of the word.Using the functions of the subject crawler’s own related resources,we continuously expanded the corpus and obtain themed documents through LDA training to expand and update the thesaurus.On this basis,an improved similarity calculation model based on word2vec word vector was proposed for page similarity calculation and URL prioritization.Experiments on real news datasets showed that the focused crawler proposed in this paper all performed well on the accuracy of topic relevance and the yield of topic content.

关 键 词:LDA主题模型 主题爬虫 word2vec 相似度计算 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象