基于分块和滑窗技术的相似重复记录检测算法研究  被引量:7

AN DUPLICATE DETECTION APPROACH BASED ON BLOCKING AND WINDOWING

在线阅读下载全文

作  者:陈亮[1] 杜璐 胡康[1] Chen Liang;Du Lu;Hu Kang(School of Computer Science, Xi' an Polytechnic University, Xi'an 710048, Shaanxi, China)

机构地区:[1]西安工程大学计算机科学学院,陕西西安710048

出  处:《计算机应用与软件》2019年第4期262-267,共6页Computer Applications and Software

基  金:陕西省工业攻关资助项目(2014K05-43);陕西省教育厅专项科研项目(14JK1310);广东省计算机集成制造重点实验室(CIMSOF2016001)

摘  要:相似重复记录检测对于提高数据质量有着重要意义。为了减少检测代价和提高运行效率,基于传统的窗口技术和分块技术,提出一种相似重复记录检测算法。该算法利用关键字段将数据集进行排序和分块,并利用滑动窗口技术限制分块间比对。设计一种多字段排序改进算法,对不同字段的分块共同聚类,优先比较重复密度大的分块对,摒弃聚类较差的分块。该算法减少了检测过程中的数据比较次数,并降低了字段好坏对算法速度的影响。理论和实验分析表明,该算法能有效地提高相似重复记录检测的准确率和时间效率。Duplicate detection plays an important role in data quality. In order to reduce the detection cost and improve the algorithm efficiency, we proposed an effective duplicate detection algorithm, which was based on the traditional windowing and blocking. The algorithm sorted data sets by keyword and partitioned data into multiple blocks. And windowing technology was applied to restrict comparisons between blocks. We designed an improved multi-key sorting algorithm. It clustered different key together, gave priority to the block pairs with high repetition density and discarded the blocks with poor clustering. The improved algorithm reduced the number of data comparisons in the detection process, and reduced the impact of field quality on the speed of the algorithm. Theoretical and experimental analyses shows that it can effectively improve the accuracy and time efficiency of duplicate detection.

关 键 词:数据质量 相似重复记录检测 窗口技术 分块技术 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象