多数据源近似重复记录增量式识别方法仿真  

Simulation of Incremental Recognition Method of Multi-Data Source Approximate Repeat Record

在线阅读下载全文

作  者:蒙芳 翟建丽[1] MENG Fang;ZHAI Jian-li(Huali College Guangdong University of Technology,Guangdong Guangzhou 511325,China)

机构地区:[1]广东工业大学华立学院,广东广州511325

出  处:《计算机仿真》2020年第8期362-365,423,共5页Computer Simulation

基  金:基于开方式虚拟实验室计算机仿真学科改革与研究(2015GXJK185)。

摘  要:在进行数据录入的过程中,经常会发生录错、数据源表现各异等状况。因而针对传统的多数据源近似重复记录增量式识别方法存在执行时间较长、查准率、查全率较低等问题,提出了一种基于MapReduce编程模型的多数据源近似重复记录增量式识别方法。引用基本近邻排序方法将数据集中的记录按照设定的关键字进行排序,在排序后的数据集上移动一个固定大小的窗口,检测该窗口内的记录,并判断它们是否匹配。匹配结果通过MapReduce编程模型进行排序整合,采用跳动窗口进行重复数据记录识别,获取最终的识别结果。实验结果表明,所提方法在确保重复数据识别精度的基础上,有效节省了识别时间。In the process of data entry,the traditional incremental recognition method of multi-data source approximate duplicate record leads to long execution time,low precision rate and low recall rate.In this article,an incremental recognition of method multi-data source approximate duplicate record based on MapReduce programming model was proposed.The basic neighbor sorting method was used to sort the records in data set according to the keywords.Then,a fixed window on the sorted data set was moved and the records in this window were detected to judge whether they were matched.The matching results were sorted and integrated by the MapReduce programming model.The duplicate data record was recognized by the jumping window.Finally,the recognition result was obtained.Simulation results show that the proposed method saves the recognition time on the basis of ensuring the accuracy of duplicate data recognition.

关 键 词:多数据源 近似重复记录 增量式识别方法 

分 类 号:TP393[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象