一种自适应信息集成方法被引量：2

A self-adaptive approach for information integration

机构地区：[1]南京财经大学信息工程学院 [2]南京财经大学图书馆,江苏南京210003

出　　处：《计算机应用》2005年第3期666-669,共4页journal of Computer Applications

摘　　要：检测相似重复记录是信息集成中的关键任务之一,尽管已经提出了各种检测相似重复记录的方法,但字符串匹配算法是这些检测方法中的核心。在提出的自适应信息集成算法中,用一个综合了编辑距离和标记距离的混合相似度去度量字符串之间的相似度。为了避免由于表达方式的差异而造成的字符串之间的不匹配,字符串被分割成独立的单词后按单词的第一个字符进行排序。在单词的匹配中,对拼写错误和缩写有一定的容错功能。实验结果表明,自适应信息集成方法比用Smith Waterman和Jaro距离有更高的正确率。Detecting records that are approximate duplicates, but not exact duplicates, is one of the key tasks in information integration. Although various algorithms have been presented for detecting duplicated records, strings matching is essential to those algorithms. In self- adaptive information integration algorithm presented by this paper, the hybrid similarity, a comprehensive edit distance and token metric, was used to measure the similar degree between strings. In order to avoid mismatching because of different expressions, the strings in records were partitioned into vocabularies, then were sorted according to their first character. In the process of vocabularies matching, misspellings and abbreviations can be tolerated. The experimental results demonstrate that the self-adaptive approach for information integration achieves higher accuracy than that using Smith-Waterman edit distance and Jaro distance.

关键词：相似重复记录混合相似度自适应信息集成字符串匹配

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种自适应信息集成方法被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种自适应信息集成方法 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种自适应信息集成方法被引量：2