劣质数据库上阈值相似连接结果大小估计被引量：6

Similarity Join Size Estimation with Threshold for Dirty Database

出　　处：《计算机学报》2012年第10期2159-2168,共10页Chinese Journal of Computers

基　　金：国家"九七三"重点基础研究发展规划项目基金(2012CB316200);国家自然科学基金(61003046;61033015;61133002);RSE-NSFC交流项目(61111130189);教育部博士点基金(20102302120054);中央高校基本科研业务费转向资金(HIT.NSRIF.2013064)资助~~

摘　　要：劣质数据普遍存在于现代数据管理系统中,严重影响了数据的质量,从而降低了数据的实用性以及数据的价值,这为数据管理带来了新的挑战.当前,已经有不少管理劣质数据的数据模型被提出,实体关系数据模型是其中一种,其中每条元组表示一个现实世界中的实体.该模型允许劣质数据的存在,给出了衡量数据质量的方法,并且可根据用户对结果质量的需求给出达到一定质量的查询结果.鉴于该模型的特点,传统的查询代价估计方法不再适用,需要新的代价估计技术.文中研究如何估计连接操作结果的大小,提出了在应用局部敏感Hash算法对属性值聚类的基础上,再进行采样估计的方法,并且在聚类过程中考虑数据质量对查询结果的影响.与传统随机采样方法对比,实验结果表明文中估计方法有更好的准确性.Dirty data exists with large probability in modern data management systems, which affects the quality of the data, and determines data utility and data value. This brings new challenges for data management. Currently, many dirty data management models have been proposed, and one of them is entity-based relational database model in which one tuple represents a real-world entity. This model allows the existence of dirty data, and proposes the evaluation of data quality. It also can generate query results satisfying the quality requirements provided by users. With the features of the model, traditional query cost estimation models are not suitable for this model. Therefore, new cost estimation methods need to be developed. This paper focuses on the estimation of the result size of join operator and proposes a sampling-based algorithm based on the Locality Sensitive Hashing （LSH） to cluster similar objects. Compared with the traditional random sampling method, experimental results show that our method gives more accurate estimations.

关键词：代价估计采样估计劣质数据数据质量阈值

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

劣质数据库上阈值相似连接结果大小估计被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

劣质数据库上阈值相似连接结果大小估计 被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

劣质数据库上阈值相似连接结果大小估计被引量：6