基于Simhash算法的重复数据删除技术的研究与改进被引量：15

Research and improvement of data de-duplication based on simhash algorithm

出　　处：《南京邮电大学学报（自然科学版）》2016年第3期85-91,共7页Journal of Nanjing University of Posts and Telecommunications：Natural Science Edition

基　　金：国家自然科学基金(11501302)资助项目

摘　　要：为了在大规模文档去重中提高相似数据检测的精度,对基于Simhash算法的大规模文档去重技术进行深入研究。在原有算法的基础之上对Simhash签名值的计算过程作出改进,引入ICTCLAS分词技术,将TF-IDF技术作为计算权重的主要方法,同时将特征值的词性与词长两大影响因素考虑其中。然后对产生的签名值进行汉明距离的比较,从而精确地判定出待比较者是否为相似数据。实验结果表明:改进的算法性能得到提高,并且总体优于Shingle算法和原Simhash算法。通过提高签名值的精度能够实现大规模文档中相似技术的精确检测,达到理想的去重效果。To improve the detecting accuracy of approximately duplicated records in extensive data de-du- plication, an extensive data de-duplication technology based on Simhash algorithm is studied. Based on the existing algorithms, Simhash algorithm has made an improvement in calculation process to introduce ICTCLAS word segmentation technology and gain weight value, it sets the TF-IDF technology as the main method for calculating weight value. Furthermore, the part-of-speech and the word length are introduced as a considered weighting factor, then comparing the hamming distances between signatures are compared to accurately identify whether they are alike. The simulation results show that the modified algorithm has high accuracy .and recall rate, and the detection performance of is superior to the Shingle algorithm and the prime algorithm. By improving the accuracy of the signature value, it can realize the accurate detec- tion of extensive data de-duplication, thus achieving the perfect results.

关键词：相似检测 Simhash算法 TF-IDF技术指纹计算汉明距离

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Simhash算法的重复数据删除技术的研究与改进被引量：15

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Simhash算法的重复数据删除技术的研究与改进 被引量：15

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于Simhash算法的重复数据删除技术的研究与改进被引量：15