基于后缀树聚类的主题搜索引擎研究  被引量:4

Research on the Focused Search Engine Based on Suffix Tree Clustering

在线阅读下载全文

作  者:韦美峰 王亚民[1] 

机构地区:[1]西安电子科技大学经济与管理学院,陕西西安710071

出  处:《情报理论与实践》2017年第12期123-127,62,共6页Information Studies:Theory & Application

摘  要:[目的/意义]一个好的主题搜索引擎能够更好地满足专业领域用户的信息需求。[方法/过程]在爬取阶段采用锚文本正则表达式匹配进行主题过滤、加入IKAnalyzer中文分词器,结合TF-IDF、OPIC和Topic-PageRank算法对检索结果排序进行改进并通过STC算法对检索结果实时聚类。[结果/结论]以"图书情报"为主题进行实验测试,每增加一个分布式计算节点爬取速率提高20%,查准率优于未排序优化23%,检索结果可以实时聚类并以可视化展示,且检索结果项多为相关论文。[局限]系统对网页中繁多的数据格式解析度不够,未解析的部分可能包含主题内容。[ Purpose/significance] A good focused search engine can meet the professional users' information needs. [ Method/process] The system proposed in this paper implements topic filtering in the stage of crawling by using anchor text regular expression match. On this basis, the paper uses IKAnalyzer Chinese word segmentation machine and combines with TF-IDF, OPIC and Topic-PageRank algorithm to optimize the retrieval results, and applies STC algorithm to real-time clustering of the results. [ Result/conclusion] Using "Library and Information Science" as the theme for test, adding one distributing computing node each time can promote the crawling rate increasing by 20%, the results precision ratio is 23% higher than that of none optimized algorithm, the search results can cluster in real-time and be visualized, and most of the retrieval result items are related papers. [ Limitations ] The content of the web page has various data formats which are not fully analyzed and may contain important content.

关 键 词:主题过滤 后缀树聚类 搜索引擎 

分 类 号:TP391.3[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象