M2LSH:基于LSH的高维数据近似最近邻查找算法  被引量:5

M2LSH: An LSH Based Technique for Approximate Nearest Neighbor Searching on High Dimensional Data

在线阅读下载全文

作  者:李灿[1] 钱江波[1] 董一鸿[1] 陈华辉[1] 

机构地区:[1]宁波大学信息科学与工程学院,浙江宁波315211

出  处:《电子学报》2017年第6期1431-1442,共12页Acta Electronica Sinica

基  金:国家自然科学基金(No.61472194;No.61572266);浙江省自然科学基金(No.LY13F020040;No.LY16F020003);宁波市自然科学基金(No.2014A610023;No.2015A610119)

摘  要:在许多应用中,LSH(Locality Sensitive Hashing)以及各种变体,是解决近似最近邻问题的有效算法之一.虽然这些算法能够很好地处理分布比较均匀的高维数据,但从设计方案来看,都没有针对数据分布不均匀的情况做相应的优化.针对这一问题,本文提出了一种新的基于LSH的解决方案(M2LSH,2 Layers Merging LSH),对于数据分布不均匀的情况依然能得到一个比较好的查询效果.首先,将数据存放到具有计数功能的组合哈希向量表示的哈希桶中,然后通过二次哈希将这些桶号投影到一维空间,在此空间根据各个桶中存放的数据个数合并相邻哈希桶,使得新哈希桶中的数据量能够大致均衡.查询时仅访问有限个哈希桶,就能找到较优结果.本文给出了详细的理论分析,并通过实验验证了M2LSH的性能,不仅能减少访问时间,也可提高结果的正确率.The LSH (Locality Sensitive Hashing) method and its variants represent the state-of-the-art indexing tech- niques to process ANN (Approximate Nearest Neighbor) searches in a high dimensional data space. Although the existing LSH based techniques can efficiently deal with uniform distributed data, it is noticed from the principle of their design schemes that these techniques cannot handle non-uniform distributed data well. To tackle the above problem, this paper presents a new LSH based technique, called M2LSH (2 Layers Merging LSH), to efficiently process ANN searches on non-uniform distributed data. The key idea is as follows. First, data objects are stored in a hash table with multiple buckets (i. e. , the first layer) ,each of which corresponds to a combined hash vector used as its hash number. The count of data objects in each hash bucket is recorded. Secondly, using the hash functions for a second layer hash table, the bucket numbers from the first layer are projected into a one-dimensional space. In this space, some adjacent hash buckets from the first layer are merged so as to make the data objects uniformly distributed in the merged buckets at the second layer. Therefore, the M2LSH can not only improve the searching efficiency but also increase the accuracy of the search results. This paper also provides a detailed theoretical accuracy analysis for the M2LSH technique. Experiments using synthetic and real data show that the theoretical estimates are quite accurate, and the proposed M2LSH technique can efficiently process ANN searches with low false posi- tive and negative rates.

关 键 词:近似最近邻 KNN查询 局部敏感哈希 高维数据 

分 类 号:TP311.3[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象