基于数据摘要奇偶性的集合相似性近似算法  

Set Similarity Approximation Algorithm Based on Parity of Data Sketch

在线阅读下载全文

作  者:贾建伟[1] 陈崚[1] 

机构地区:[1]扬州大学信息工程学院,扬州225002

出  处:《计算机科学》2016年第6期254-256,311,共4页Computer Science

基  金:国家自然科学基金面上项目(61070240)资助

摘  要:在应用b位哈希函数近似计算两个集合的Jaccard相似性时,如果有多个元素与输入元素的Jaccard相似性都很高(接近于1),那么b位哈希函数不能对这些元素进行很好的区分。为了提高数据摘要函数的准确性并提高基于相似性的应用的性能,提出了一种基于数据摘要奇偶性的集合相似性近似算法。在应用minwise哈希函数得到两个变异集合后,用两个n位指示向量来表示变异集合中的元素在指示向量中出现的奇偶性,并基于这两个奇偶性向量来估计原集合间的Jaccard相似性。通过马尔科夫链和泊松分布两种模型对奇偶性数据摘要进行了推导,并证明了这两种方法的等价性。Enron数据集上的实验表明,提出的奇偶性数据摘要算法与传统的b位哈希函数相比具有更高的准确性,并且在重复文档检测和关联规则挖掘两种应用中具有更高的性能。Jaccard similarity is one of the most important methods in set similarity computation. When approximately computing the Jaccard similarity of two sets using the b-bits hash function, if there are multiple elements being similar to the input element with similarity up to 1,the b-bits hash function can't differentiate these elements very well. In or- der to improve the accuracy of data sketch and the application performance based on set similarity, this paper proposed a set similarity approximation algorithm based on parity of data sketch. After getting the two permutation sets with minwise hash function,we used two n-bits indicator vectors to represent the parity of elements in the permutation set appearing in indicator vectors, and estimated the Jaccard similarity of original sets based on these two parity vectors. We inferred the parity sketch based on both Markov chain and Poisson distribution models, and verified their equivalence. Experiments on Enron dataset show that the proposed parity sketch is more accurate than the b-bits hash function, and performs much better in both applications of duplicate document detection and associate rule mining.

关 键 词:数据摘要 集合相似性 奇偶性 近似算法 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象