一种基于动态平衡树的在线索引快速构建方法  被引量:5

A Fast On-Line Index Construction Method Based on Dynamic Balancing Tree

在线阅读下载全文

作  者:郭瑞杰[1,2] 程学旗[1] 许洪波[1] 王斌[1] 丁国栋[1] 

机构地区:[1]中国科学院计算技术研究所,北京100190 [2]中国科学院研究生院,北京100049

出  处:《计算机研究与发展》2008年第10期1769-1775,共7页Journal of Computer Research and Development

基  金:国家“八六三”高技术研究发展计划基金项目(2006AA010105,2007AA01Z416);国家“九七三”重点基础研究发展规划基金项目(2004CB318109)~~

摘  要:倒排索引的构建可以通过离线方式高效地完成,但是仅当整个数据集索引完毕后方可提供检索服务.在线索引可以在构建倒排索引的同时提供检索服务,新加入的文档即刻可供检索.提出了一种基于动态平衡树的在线索引更新策略,利用动态平衡树控制索引合并过程,使索引合并总是在大小相近的子索引之间进行,以减少索引合并代价,同时可以调节索引和检索之间的性能平衡.该方法提供了一个基于合并的在线索引更新框架,与已有方法相比具有更好的通用性、更高的性能和更好的规模可扩展性.在由4000万张网页构成的270GBWeb数据集上运行的实验表明,该方法在实际系统中是高效的,将索引更新的性能提高了92.28%,而检索性能仅下降4.79%,大幅度降低了在线索引构建的代价.Examined in this paper is the issues of on-line maintenance for growing text collections. This approach builds inverted index to reflect changes in the collection it describes, supporting efficient document insertions and deletions. A fast on-line index construction method based on dynamic balancing tree is proposed. This method takes a merge-based approach and aims to keep a good performance during indexing and query processing by always merging indices with similar sizes. This is achieved by using a tree in which each node corresponds to one sub-index. The placing of sub-indexes in the tree depends on the sizes of the sub-indexes, so that each merging takes place among similarly sized sub-indexes. The method is suited very well not only for growing text collections, but also for dynamic text collections, and is a uniform framework for merge-based index maintenance strategies. It is more general, faster and scales better than previous methods, and can be adapted to suit different balances of insertion and querying operations. Using experiments on large scale Web data, the efficiency, flexibility and scalability of this method are demonstrated in practice, showing that on-line index construction for growing text collections can be performed efficiently and almost as fast as for static text collections.

关 键 词:信息检索 在线索引 索引性能 检索性能 动态平衡树 

分 类 号:TP391.3[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象