检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:缪嘉嘉[1] 吴刚[1] 毛捍东[2] 杨强[2] 邓苏[2]
机构地区:[1]国防科学技术大学计算机学院,湖南长沙410073 [2]国防科学技术大学人文管理学院,湖南长沙410073
出 处:《小型微型计算机系统》2004年第12期2164-2168,共5页Journal of Chinese Computer Systems
基 金:国家自然科学基金 ( 60 10 3 0 0 9)资助
摘 要:数据集成往往会形成一些近似重复记录 ,如何检测重复信息是数据质量研究中的一个热门课题 .文中提出了一种高效的基于条件概率分布的动态聚类算法来进行近似重复记录检测 .该方法在评估两个记录之间是否近似等价的问题上 ,解决了原来的算法忽略序列结构特点的问题 ,基于条件概率分布定义了记录间的距离 ;并根据近邻函数准则选择了一个评议聚类结果质量的准则函数 ,采用动态聚类算法完成对序列数据集的聚类 .使用该方法 ,对仿真数据进行了聚类实验 。Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same real-world entity because of data entry errors, because of un-standardized abbreviations, or because of differences in the detailed schemas of records from multiple databases, among other reasons. Investigate the problem of detecting duplications based on their structural features, then presented an efficient and effective algorithm for recognizing clusters of approximately duplicate records. The conditional probability distribution (CPD) of the next symbol given a preceding segment is derived and used to characterize sequence record and to support the distance measure. A variation of the suffix tree, namely probabilistic suffix tree, is employed to organize the CPD in a concise way. And based on the near neighbour rules, we select a rule function to comment the clustering results. Finally, dynamic clustering algorithm is employed to cluster the dataset. Comprehensive experiments on synthetic database records confirm the effectiveness of the new algorithm.
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.15