检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
出 处:《华南师范大学学报(自然科学版)》2015年第1期121-126,共6页Journal of South China Normal University(Natural Science Edition)
基 金:国家自然科学基金项目(61142012)
摘 要:对Simhash算法进行改进,用City Hash函数生成数据指纹特征值,以此对数据进行判重.在广州市某区政府的信访业务真实数据下进行了实验,实验结果相对其他算法具有较高的召回率和准确率.并提出了一种索引归类方法来提高全部数据一次性相似检测的速度,在Mongo DB数据库存储指纹值的前提下,为增量数据的高效判重处理提供了保障.通过对数据的整个判重过程的改进,达到了高效、实用的价值,为科学办案、重复办案提供了参考依据.With the growth of data in traditional relational databases,the probability of the similar data is increasing greatly. By using City Hash function to get fingerprint characteristic value,the Simhash algorithm is improved in order to detect the duplicate data. It has been tested by real data from petition business in the district government of Guangzhou city,the results show that it has higher recall and precision than other algorithms. Moreover,an index classification method to improve the speed of similarity detection for all data is presented. Meanwhile,the method provides a guarantee for the efficient processing of incremental data on the premise of the fingerprint values stored by Mongo DB database. It also improves the whole process of similarity detection and provides a reference for scientific investigators.
关 键 词:Simhash CityHash MONGODB 指纹特征值 相似检测
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117