检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:费晨杰 刘柏嵩[1] Fei Chenjie;Liu Baisong(College of Information Science and Engineering,Ningbo University,Ningbo 315211,Zhejiang,China)
机构地区:[1]宁波大学信息科学与工程学院,浙江宁波315211
出 处:《计算机应用与软件》2018年第4期49-54,共6页Computer Applications and Software
基 金:国家社会科学基金项目/后期资助项目(15FTQ002);省部级实验室/开放基金项目(B2014)
摘 要:主题爬虫的目的在于尽可能准确地获取与特定主题相关的内容。针对主题爬虫主题覆盖率不足和主题相似度计算准确度偏低,提出一种动态主题的主题爬虫框架,对主题关键词进行两重扩展:用同主题的词扩展和词的语义扩展。利用主题爬虫自身主题相关资源收集的功能,不断对语料进行扩充,通过LDA训练得到主题文档来进行主题词库扩展更新。在此基础上,提出一种基于word2vec词向量表示的改进相似度计算模型,用于页面相似度计算和URL优先级排序。通过在真实新闻数据集上的实验表明,提出的爬虫在主题相关度的判断准确度和主题内容收获率上均有较好表现。The purpose of a focused crawler is to get as much content as possible related to a particular topic.In view of the lack of focused crawler coverage and the low accuracy of topic similarity calculation,this paper proposed a focused crawler framework with dynamic theme,which expanded the theme keywords in two ways:expansion of the word with the subject and semantic extension of the word.Using the functions of the subject crawler’s own related resources,we continuously expanded the corpus and obtain themed documents through LDA training to expand and update the thesaurus.On this basis,an improved similarity calculation model based on word2vec word vector was proposed for page similarity calculation and URL prioritization.Experiments on real news datasets showed that the focused crawler proposed in this paper all performed well on the accuracy of topic relevance and the yield of topic content.
关 键 词:LDA主题模型 主题爬虫 word2vec 相似度计算
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.248