一种基于条件概率分布的近似重复记录检测方法被引量：3

Algorithm for Detecting Approximately Duplicate Database Records Based on Conditional Probability Distribution

作　　者：缪嘉嘉[1] 吴刚[1] 毛捍东[2] 杨强[2] 邓苏[2]

机构地区：[1]国防科学技术大学计算机学院,湖南长沙410073 [2]国防科学技术大学人文管理学院,湖南长沙410073

出　　处：《小型微型计算机系统》2004年第12期2164-2168,共5页Journal of Chinese Computer Systems

基　　金：国家自然科学基金 ( 60 10 3 0 0 9)资助

摘　　要：数据集成往往会形成一些近似重复记录 ,如何检测重复信息是数据质量研究中的一个热门课题 .文中提出了一种高效的基于条件概率分布的动态聚类算法来进行近似重复记录检测 .该方法在评估两个记录之间是否近似等价的问题上 ,解决了原来的算法忽略序列结构特点的问题 ,基于条件概率分布定义了记录间的距离 ;并根据近邻函数准则选择了一个评议聚类结果质量的准则函数 ,采用动态聚类算法完成对序列数据集的聚类 .使用该方法 ,对仿真数据进行了聚类实验。Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same real-world entity because of data entry errors, because of un-standardized abbreviations, or because of differences in the detailed schemas of records from multiple databases, among other reasons. Investigate the problem of detecting duplications based on their structural features, then presented an efficient and effective algorithm for recognizing clusters of approximately duplicate records. The conditional probability distribution (CPD) of the next symbol given a preceding segment is derived and used to characterize sequence record and to support the distance measure. A variation of the suffix tree, namely probabilistic suffix tree, is employed to organize the CPD in a concise way. And based on the near neighbour rules, we select a rule function to comment the clustering results. Finally, dynamic clustering algorithm is employed to cluster the dataset. Comprehensive experiments on synthetic database records confirm the effectiveness of the new algorithm.

关键词：信息集成近似重复记录动态聚类概率后缀树

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于条件概率分布的近似重复记录检测方法被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于条件概率分布的近似重复记录检测方法 被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种基于条件概率分布的近似重复记录检测方法被引量：3