面向时序的相似重复数据清洗算法优化  被引量:2

Time-series-oriented duplicate data cleaning algorithm optimization

在线阅读下载全文

作  者:沈沛 毛海涛[1] 胡文林 芮波 Shen Pei;Mao Haitao;Hu Wenlin;Rui Bo(Unit 92728 of PLA,Shanghai 200436,China;Hangzhou Mix Link Technology Co.,Ltd.)

机构地区:[1]中国人民解放军92728部队,上海200436 [2]杭州幂链科技有限公司

出  处:《计算机时代》2022年第9期68-72,77,共6页Computer Era

摘  要:针对海量时序数据集提出了一种相似重复数据检测算法。该算法以传统近邻排序算法SNM为基础,增加了对窗口大小的动态调整策略,新增了窗口跳跃滑动策略。新策略大大减少了相似重复数据清洗过程中的比对次数。该算法的提出,对时序数据集中的相似重复记录清洗效果带来了极大的提升,尤其是对于相似重复记录较稀疏的数据集,在理论和实验结果中均表明该算法在提高相似重复数据的检测性能上有显著效果。Aiming at massive time series data sets, an approximately duplicate data detection algorithm is proposed. Based on the traditional nearest neighbor sorting algorithm SNM, a dynamic adjustment strategy for the window size and a window jump sliding strategy are added to the algorithm. It greatly reduces the number of comparisons in the process of approximately duplicate data cleaning. The proposed algorithm has greatly improved the cleaning effect of duplicate approximately records in time series data sets, especially for data sets with sparse approximately duplicate records. Both theoretical and experimental results show that the algorithm is significantly effective in improving the detection performance of approximately duplicate data.

关 键 词:时序数据 SNM改进算法 相似重复数据 动态滑动窗口 数据清洗 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象