基于内码序值聚类的相似重复记录检测方法被引量：8

Approach for detecting approximately duplicate records based on cluster of inner code's sequence value

机构地区：[1]江苏大学计算机科学与通信工程学院,江苏镇江212013

出　　处：《计算机应用研究》2010年第3期874-878,共5页Application Research of Computers

基　　金：国家火炬计划资助项目(2004EB33006[0]);江苏省高校自然科学指导性计划资助项目(05JKD520050)

摘　　要：检测和消除相似重复记录是数据清理和提高数据质量要解决的关键问题之一,针对相似重复记录问题,提出了基于内码序值聚类的相似重复记录检测方法。该方法先选择关键字段或字段某些位,根据字符的内码序值,利用聚类思想将大数据集聚集成多个小数据集;然后,通过等级法计算各字段的权值,并将其应用在相似重复记录的检测算法中;最后,在各个小数据集中检测和消除相似重复记录。为避免关键字选择不当而造成记录漏查问题,采用多趟检测方法进行多次检测。通过实验表明,该方法具有较好的检测精度和时间效率,能很好地应用到中英文字符集,通用性很强,并能够有效地解决大数据量的相似重复记录检测问题。Detecting and eliminating approximately duplicated records is one of main problems needed to solve for data cleaning and improving data quality. As to such problem, this paper presented an approach for detecting approximately duplicate records based on cluster of inner code＇ s sequence value. The proposed method firstly chose the key field or some bits of it, and according to the inner code ＇ s sequence value of character, clustered large datasets into many small datasets by cluster thought. Then in term of rank-based weights method, endowed each attribute with certain weight using in detecting approximately duplicate records. Finally, detected approximately duplicated records and eliminated in each small dataset. To avoid missing some records caused by choosing improper key field, the multiple-detecting method could be adopted. Experimental results show the proposed method has good detection precision and time efficiency, can be applied to English and Chinese character set, and therefore is an effective approach to solve approximately duplicate records for massive data.

关键词：相似重复记录内码序值聚类等级法

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于内码序值聚类的相似重复记录检测方法被引量：8

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于内码序值聚类的相似重复记录检测方法 被引量：8

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于内码序值聚类的相似重复记录检测方法被引量：8