检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]南京邮电大学计算机学院,江苏南京210023
出 处:《南京邮电大学学报(自然科学版)》2016年第3期85-91,共7页Journal of Nanjing University of Posts and Telecommunications:Natural Science Edition
基 金:国家自然科学基金(11501302)资助项目
摘 要:为了在大规模文档去重中提高相似数据检测的精度,对基于Simhash算法的大规模文档去重技术进行深入研究。在原有算法的基础之上对Simhash签名值的计算过程作出改进,引入ICTCLAS分词技术,将TF-IDF技术作为计算权重的主要方法,同时将特征值的词性与词长两大影响因素考虑其中。然后对产生的签名值进行汉明距离的比较,从而精确地判定出待比较者是否为相似数据。实验结果表明:改进的算法性能得到提高,并且总体优于Shingle算法和原Simhash算法。通过提高签名值的精度能够实现大规模文档中相似技术的精确检测,达到理想的去重效果。To improve the detecting accuracy of approximately duplicated records in extensive data de-du- plication, an extensive data de-duplication technology based on Simhash algorithm is studied. Based on the existing algorithms, Simhash algorithm has made an improvement in calculation process to introduce ICTCLAS word segmentation technology and gain weight value, it sets the TF-IDF technology as the main method for calculating weight value. Furthermore, the part-of-speech and the word length are introduced as a considered weighting factor, then comparing the hamming distances between signatures are compared to accurately identify whether they are alike. The simulation results show that the modified algorithm has high accuracy .and recall rate, and the detection performance of is superior to the Shingle algorithm and the prime algorithm. By improving the accuracy of the signature value, it can realize the accurate detec- tion of extensive data de-duplication, thus achieving the perfect results.
关 键 词:相似检测 Simhash算法 TF-IDF技术 指纹计算 汉明距离
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.90