检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]西安电子科技大学经济与管理学院,陕西西安710071
出 处:《情报理论与实践》2017年第12期123-127,62,共6页Information Studies:Theory & Application
摘 要:[目的/意义]一个好的主题搜索引擎能够更好地满足专业领域用户的信息需求。[方法/过程]在爬取阶段采用锚文本正则表达式匹配进行主题过滤、加入IKAnalyzer中文分词器,结合TF-IDF、OPIC和Topic-PageRank算法对检索结果排序进行改进并通过STC算法对检索结果实时聚类。[结果/结论]以"图书情报"为主题进行实验测试,每增加一个分布式计算节点爬取速率提高20%,查准率优于未排序优化23%,检索结果可以实时聚类并以可视化展示,且检索结果项多为相关论文。[局限]系统对网页中繁多的数据格式解析度不够,未解析的部分可能包含主题内容。[ Purpose/significance] A good focused search engine can meet the professional users' information needs. [ Method/process] The system proposed in this paper implements topic filtering in the stage of crawling by using anchor text regular expression match. On this basis, the paper uses IKAnalyzer Chinese word segmentation machine and combines with TF-IDF, OPIC and Topic-PageRank algorithm to optimize the retrieval results, and applies STC algorithm to real-time clustering of the results. [ Result/conclusion] Using "Library and Information Science" as the theme for test, adding one distributing computing node each time can promote the crawling rate increasing by 20%, the results precision ratio is 23% higher than that of none optimized algorithm, the search results can cluster in real-time and be visualized, and most of the retrieval result items are related papers. [ Limitations ] The content of the web page has various data formats which are not fully analyzed and may contain important content.
分 类 号:TP391.3[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.221.133.22