一种改进的面向VLDB数据质量处理算法

An Improved Arithmetic for Data Quality in Very Large Database

机构地区：[1]上海工程技术大学高职学院,上海200437 [2]上海宝信软件,上海201203

出　　处：《科技创新导报》2009年第2期43-45,共3页Science and Technology Innovation Herald

摘　　要：数据质量问题是企业在构建商务智能系统中遇到的最重要的问题之一,在处理面向VLDB数据质量的时候,对模糊重复记录的识别和整合非常困难。文章中提出了一种改进的面向VLDB数据质量处理算法,即先通过基于聚类的N-gram的改进算法来检测相似重复记录,采用pair-wise来计算相似重复度,用一个固定大小的优先队列窗口来聚类相似重复记录,同时引入转换关闭准则生成一种多路聚类方法,提高聚类的准确度。本文的算法在语言识别和关键字检测方面获得高于90%的准确率。Data quality problem is very important in design of business intelligence system. It is difficult to detect and eliminate duplications when processing data quality questions in very large database. This article proposes an improved arithmetic for very large database. First an efficient N-Gram based clustering algorithm is adopted to detect duplicated records. And then apply Pair-Wise comparison algorithm to the inspection of the exact degree of the similar records. For detecting approximately duplicate records, an improved algorithm that employs the priority queue is presented; at the same time, a transitive-closure phase based multi-pass clustering is proposed to improve the data accuracy. The algorithm offered in this article acquires more than 90% accuracy in both language identification and keyword detection.

关键词：数据质量聚类多通道方法

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种改进的面向VLDB数据质量处理算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种改进的面向VLDB数据质量处理算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索