RM树:一种支持字符串相似性操作的索引被引量：6

RM-Tree: An Index Supporting String Similarity Operations

机构地区：[1]哈尔滨工业大学计算机科学与技术学院,哈尔滨150001 [2]哈尔滨工业大学基础与交叉科学研究院高性能计算中心,哈尔滨150001

出　　处：《计算机学报》2011年第11期2142-2154,共13页Chinese Journal of Computers

基　　金：国家"九七三"重点基础研究发展规划项目基金(2012CB316200);国家自然科学基金(60903016;61003046;60533110;60773063;61173022);黑龙江省自然科学基金(F201031);中国博士后科学基金(20110491064);黑龙江省博士后基金(LBH-Z09140);哈工大科研创新基金"中央高校基本科研业务费专项资金"(HIT.NSRIF.2010060);哈工大优秀青年教师培养计划(HITQNJS2009.063)资助

摘　　要：字符串相似性操作在很多领域中被广泛应用,如数据清洁、信息集成等.现有研究工作主要为基于q-Gram和倒排索引的内存方法,在处理大量数据时具有以下缺点:内存消耗大、更新效率低、支持操作类型有限.现有的外存索引Bed树无法将相似的字符串聚类,在查询处理过程中导致了较大的I/O代价.该文设计了支持多种字符串相似性操作的RM树索引,消除了现有内存方法的缺点,并通过字符串聚类的方法提高了相似性操作的效率.该文通过大量实验结果证明了RM树的有效性.String similarity processing is well adopted in various fields such as data cleaning,information integration,spelling check and bioinformatics.With the rapid growth of information,processing strings over massive datasets becomes a significant basic operation.Existing solutions based on q-Gram and inverted lists suffer from the following disadvantages： large storage cost,poor efficiency of updates and limited types of operation.At meanwhile,most of these methods are main memory based.Be d-tree is a recently proposed tree structured,disk resident index which supports diverse operations.Be d-tree fails to cluster similar strings together,thus incurs large I/O costs while string similarity processing.This paper proposes a disk resident index RM-tree which clusters similar strings together while eliminating the shortcomings of q-Gram and inverted list based solutions.RM-tree supports multiple types of string similarity operations such as range query,top-k query and string similarity join.This paper presents the construction method of RM-tree and its query processing algorithms.RM-tree is tested with extensive experiments including index construction and different types of query processing in terms of I/O cost and time consumption with two real world datasets and a synthetic dataset.Experiment results show that RM-tree is efficient for supporting string similarity operations,and provides better performance than Be d-tree.

关键词：字符串相似性索引查询处理连接处理

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

RM树:一种支持字符串相似性操作的索引被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

RM树:一种支持字符串相似性操作的索引 被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

RM树:一种支持字符串相似性操作的索引被引量：6