检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
出 处:《计算机科学》2016年第6期254-256,311,共4页Computer Science
基 金:国家自然科学基金面上项目(61070240)资助
摘 要:在应用b位哈希函数近似计算两个集合的Jaccard相似性时,如果有多个元素与输入元素的Jaccard相似性都很高(接近于1),那么b位哈希函数不能对这些元素进行很好的区分。为了提高数据摘要函数的准确性并提高基于相似性的应用的性能,提出了一种基于数据摘要奇偶性的集合相似性近似算法。在应用minwise哈希函数得到两个变异集合后,用两个n位指示向量来表示变异集合中的元素在指示向量中出现的奇偶性,并基于这两个奇偶性向量来估计原集合间的Jaccard相似性。通过马尔科夫链和泊松分布两种模型对奇偶性数据摘要进行了推导,并证明了这两种方法的等价性。Enron数据集上的实验表明,提出的奇偶性数据摘要算法与传统的b位哈希函数相比具有更高的准确性,并且在重复文档检测和关联规则挖掘两种应用中具有更高的性能。Jaccard similarity is one of the most important methods in set similarity computation. When approximately computing the Jaccard similarity of two sets using the b-bits hash function, if there are multiple elements being similar to the input element with similarity up to 1,the b-bits hash function can't differentiate these elements very well. In or- der to improve the accuracy of data sketch and the application performance based on set similarity, this paper proposed a set similarity approximation algorithm based on parity of data sketch. After getting the two permutation sets with minwise hash function,we used two n-bits indicator vectors to represent the parity of elements in the permutation set appearing in indicator vectors, and estimated the Jaccard similarity of original sets based on these two parity vectors. We inferred the parity sketch based on both Markov chain and Poisson distribution models, and verified their equivalence. Experiments on Enron dataset show that the proposed parity sketch is more accurate than the b-bits hash function, and performs much better in both applications of duplicate document detection and associate rule mining.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.15