检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:蒙芳 翟建丽[1] MENG Fang;ZHAI Jian-li(Huali College Guangdong University of Technology,Guangdong Guangzhou 511325,China)
出 处:《计算机仿真》2020年第8期362-365,423,共5页Computer Simulation
基 金:基于开方式虚拟实验室计算机仿真学科改革与研究(2015GXJK185)。
摘 要:在进行数据录入的过程中,经常会发生录错、数据源表现各异等状况。因而针对传统的多数据源近似重复记录增量式识别方法存在执行时间较长、查准率、查全率较低等问题,提出了一种基于MapReduce编程模型的多数据源近似重复记录增量式识别方法。引用基本近邻排序方法将数据集中的记录按照设定的关键字进行排序,在排序后的数据集上移动一个固定大小的窗口,检测该窗口内的记录,并判断它们是否匹配。匹配结果通过MapReduce编程模型进行排序整合,采用跳动窗口进行重复数据记录识别,获取最终的识别结果。实验结果表明,所提方法在确保重复数据识别精度的基础上,有效节省了识别时间。In the process of data entry,the traditional incremental recognition method of multi-data source approximate duplicate record leads to long execution time,low precision rate and low recall rate.In this article,an incremental recognition of method multi-data source approximate duplicate record based on MapReduce programming model was proposed.The basic neighbor sorting method was used to sort the records in data set according to the keywords.Then,a fixed window on the sorted data set was moved and the records in this window were detected to judge whether they were matched.The matching results were sorted and integrated by the MapReduce programming model.The duplicate data record was recognized by the jumping window.Finally,the recognition result was obtained.Simulation results show that the proposed method saves the recognition time on the basis of ensuring the accuracy of duplicate data recognition.
分 类 号:TP393[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.69