RM树:一种支持字符串相似性操作的索引  被引量:6

RM-Tree: An Index Supporting String Similarity Operations

在线阅读下载全文

作  者:王金宝[1] 高宏[1] 李建中[1] 杨东华[2] 

机构地区:[1]哈尔滨工业大学计算机科学与技术学院,哈尔滨150001 [2]哈尔滨工业大学基础与交叉科学研究院高性能计算中心,哈尔滨150001

出  处:《计算机学报》2011年第11期2142-2154,共13页Chinese Journal of Computers

基  金:国家"九七三"重点基础研究发展规划项目基金(2012CB316200);国家自然科学基金(60903016;61003046;60533110;60773063;61173022);黑龙江省自然科学基金(F201031);中国博士后科学基金(20110491064);黑龙江省博士后基金(LBH-Z09140);哈工大科研创新基金"中央高校基本科研业务费专项资金"(HIT.NSRIF.2010060);哈工大优秀青年教师培养计划(HITQNJS2009.063)资助

摘  要:字符串相似性操作在很多领域中被广泛应用,如数据清洁、信息集成等.现有研究工作主要为基于q-Gram和倒排索引的内存方法,在处理大量数据时具有以下缺点:内存消耗大、更新效率低、支持操作类型有限.现有的外存索引Bed树无法将相似的字符串聚类,在查询处理过程中导致了较大的I/O代价.该文设计了支持多种字符串相似性操作的RM树索引,消除了现有内存方法的缺点,并通过字符串聚类的方法提高了相似性操作的效率.该文通过大量实验结果证明了RM树的有效性.String similarity processing is well adopted in various fields such as data cleaning,information integration,spelling check and bioinformatics.With the rapid growth of information,processing strings over massive datasets becomes a significant basic operation.Existing solutions based on q-Gram and inverted lists suffer from the following disadvantages: large storage cost,poor efficiency of updates and limited types of operation.At meanwhile,most of these methods are main memory based.Be d-tree is a recently proposed tree structured,disk resident index which supports diverse operations.Be d-tree fails to cluster similar strings together,thus incurs large I/O costs while string similarity processing.This paper proposes a disk resident index RM-tree which clusters similar strings together while eliminating the shortcomings of q-Gram and inverted list based solutions.RM-tree supports multiple types of string similarity operations such as range query,top-k query and string similarity join.This paper presents the construction method of RM-tree and its query processing algorithms.RM-tree is tested with extensive experiments including index construction and different types of query processing in terms of I/O cost and time consumption with two real world datasets and a synthetic dataset.Experiment results show that RM-tree is efficient for supporting string similarity operations,and provides better performance than Be d-tree.

关 键 词:字符串 相似性 索引 查询处理 连接处理 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象