检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王金宝[1] 高宏[1] 李建中[1] 杨东华[2]
机构地区:[1]哈尔滨工业大学计算机科学与技术学院,哈尔滨150001 [2]哈尔滨工业大学基础与交叉科学研究院高性能计算中心,哈尔滨150001
出 处:《计算机学报》2011年第11期2142-2154,共13页Chinese Journal of Computers
基 金:国家"九七三"重点基础研究发展规划项目基金(2012CB316200);国家自然科学基金(60903016;61003046;60533110;60773063;61173022);黑龙江省自然科学基金(F201031);中国博士后科学基金(20110491064);黑龙江省博士后基金(LBH-Z09140);哈工大科研创新基金"中央高校基本科研业务费专项资金"(HIT.NSRIF.2010060);哈工大优秀青年教师培养计划(HITQNJS2009.063)资助
摘 要:字符串相似性操作在很多领域中被广泛应用,如数据清洁、信息集成等.现有研究工作主要为基于q-Gram和倒排索引的内存方法,在处理大量数据时具有以下缺点:内存消耗大、更新效率低、支持操作类型有限.现有的外存索引Bed树无法将相似的字符串聚类,在查询处理过程中导致了较大的I/O代价.该文设计了支持多种字符串相似性操作的RM树索引,消除了现有内存方法的缺点,并通过字符串聚类的方法提高了相似性操作的效率.该文通过大量实验结果证明了RM树的有效性.String similarity processing is well adopted in various fields such as data cleaning,information integration,spelling check and bioinformatics.With the rapid growth of information,processing strings over massive datasets becomes a significant basic operation.Existing solutions based on q-Gram and inverted lists suffer from the following disadvantages: large storage cost,poor efficiency of updates and limited types of operation.At meanwhile,most of these methods are main memory based.Be d-tree is a recently proposed tree structured,disk resident index which supports diverse operations.Be d-tree fails to cluster similar strings together,thus incurs large I/O costs while string similarity processing.This paper proposes a disk resident index RM-tree which clusters similar strings together while eliminating the shortcomings of q-Gram and inverted list based solutions.RM-tree supports multiple types of string similarity operations such as range query,top-k query and string similarity join.This paper presents the construction method of RM-tree and its query processing algorithms.RM-tree is tested with extensive experiments including index construction and different types of query processing in terms of I/O cost and time consumption with two real world datasets and a synthetic dataset.Experiment results show that RM-tree is efficient for supporting string similarity operations,and provides better performance than Be d-tree.
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.145