Fast Semantic Duplicate Detection Techniques in Databases  

Fast Semantic Duplicate Detection Techniques in Databases

在线阅读下载全文

作  者:Ibrahim Moukouop Nguena Amolo-Makama Ophélie Carmen Richeline 

机构地区:[1]National Advanced School of Engineering, University of Yaounde I, Yaounde, Cameroon [2]Computer Sciences Department, University of Yaounde I, Yaounde, Cameroon

出  处:《Journal of Software Engineering and Applications》2017年第6期529-545,共17页软件工程与应用(英文)

摘  要:Semantic duplicates in databases represent today an important data quality challenge which leads to bad decisions. In large databases, we sometimes find ourselves with tens of thousands of duplicates, which necessitates an automatic deduplication. For this, it is necessary to detect duplicates, with a fairly reliable method to find as many duplicates as possible and powerful enough to run in a reasonable time. This paper proposes and compares on real data effective duplicates detection methods for automatic deduplication of files based on names, working with French texts or English texts, and the names of people or places, in Africa or in the West. After conducting a more complete classification of semantic duplicates than the usual classifications, we introduce several methods for detecting duplicates whose average complexity observed is less than O(2n). Through a simple model, we highlight a global efficacy rate, combining precision and recall. We propose a new metric distance between records, as well as rules for automatic duplicate detection. Analyses made on a database containing real data for an administration in Central Africa, and on a known standard database containing names of restaurants in the USA, have shown better results than those of known methods, with a lesser complexity.Semantic duplicates in databases represent today an important data quality challenge which leads to bad decisions. In large databases, we sometimes find ourselves with tens of thousands of duplicates, which necessitates an automatic deduplication. For this, it is necessary to detect duplicates, with a fairly reliable method to find as many duplicates as possible and powerful enough to run in a reasonable time. This paper proposes and compares on real data effective duplicates detection methods for automatic deduplication of files based on names, working with French texts or English texts, and the names of people or places, in Africa or in the West. After conducting a more complete classification of semantic duplicates than the usual classifications, we introduce several methods for detecting duplicates whose average complexity observed is less than O(2n). Through a simple model, we highlight a global efficacy rate, combining precision and recall. We propose a new metric distance between records, as well as rules for automatic duplicate detection. Analyses made on a database containing real data for an administration in Central Africa, and on a known standard database containing names of restaurants in the USA, have shown better results than those of known methods, with a lesser complexity.

关 键 词:SEMANTIC DUPLICATE DETECTION Technique DETECTION CAPABILITY Automatic DEDUPLICATION DETECTION Rates and Error Rates 

分 类 号:R73[医药卫生—肿瘤]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象