检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:沈沛 毛海涛[1] 胡文林 芮波 Shen Pei;Mao Haitao;Hu Wenlin;Rui Bo(Unit 92728 of PLA,Shanghai 200436,China;Hangzhou Mix Link Technology Co.,Ltd.)
机构地区:[1]中国人民解放军92728部队,上海200436 [2]杭州幂链科技有限公司
出 处:《计算机时代》2022年第9期68-72,77,共6页Computer Era
摘 要:针对海量时序数据集提出了一种相似重复数据检测算法。该算法以传统近邻排序算法SNM为基础,增加了对窗口大小的动态调整策略,新增了窗口跳跃滑动策略。新策略大大减少了相似重复数据清洗过程中的比对次数。该算法的提出,对时序数据集中的相似重复记录清洗效果带来了极大的提升,尤其是对于相似重复记录较稀疏的数据集,在理论和实验结果中均表明该算法在提高相似重复数据的检测性能上有显著效果。Aiming at massive time series data sets, an approximately duplicate data detection algorithm is proposed. Based on the traditional nearest neighbor sorting algorithm SNM, a dynamic adjustment strategy for the window size and a window jump sliding strategy are added to the algorithm. It greatly reduces the number of comparisons in the process of approximately duplicate data cleaning. The proposed algorithm has greatly improved the cleaning effect of duplicate approximately records in time series data sets, especially for data sets with sparse approximately duplicate records. Both theoretical and experimental results show that the algorithm is significantly effective in improving the detection performance of approximately duplicate data.
关 键 词:时序数据 SNM改进算法 相似重复数据 动态滑动窗口 数据清洗
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49