机构地区:[1]西安电子科技大学计算机科学与技术学院,西安710071 [2]香港中文大学系统工程与工程管理系,香港999077 [3]西安电子科技大学网络与信息安全学院,西安710071 [4]社会安全风险感知与防控大数据应用国家工程实验室,北京100084
出 处:《计算机学报》2020年第5期930-947,共18页Chinese Journal of Computers
基 金:国家自然科学基金(61976168,61702403,61672408,61972309);国家111计划(B16037);社会安全风险感知与防控大数据应用国家工程实验室主任基金项目;CCF-华为数据库创新研究计划项目(CCFHuaweiDBIR008B);中国博士后科学基金(2018M633473);陕西省自然科学基本研究计划(2015JQ6227,2018JM6073,2019CGXNG-023);江西省重点研发计划(20181ACE50029);中央高校基本科研业务基金(XJS190305,JB181505);西安电子科技大学研究生创新基金资助。
摘 要:针对外存环境中海量高维数据近似最近邻(Approximate Nearest Neighbor,ANN)查询面临的"维度灾难"和I/O性能瓶颈难题,本文提出了一种基于最优排序的局部敏感哈希(Locality-Sensitive Hashing,LSH)索引方案O2LSH(Optimal Order LSH).通过引入空间填充曲线为复合哈希键值建立线序并排序,使近邻候选点更多地分布在相同或相邻磁盘页面,实现用少量顺序I/O加载到足够多的候选点.本文对多种常用空间曲线技术进行了量化分析,发现:(1)基本排序方案SK-LSH使用的row-wise曲线具有"维度优先遍历"的特性,容易对ANN查询造成多种局限;(2)另一类"邻域优先遍历"特性的曲线能够产生更好的候选点局部分布,且排序性能更加稳定.通过对比,我们选取了一种最优的"邻域优先遍历"曲线构造线序,该线序能够最大程度地改善近邻候选点的局部分布,进一步提升磁盘访问效率和查询精度.在多个真实多媒体数据集上进行了对比实验,证实了O2LSH相对于先进LSH方案(包括C2LSH、SK-LSH、SRS以及QALSH)在查询精度和I/O效率上的优越性.特别地,O2LSH克服了基本排序方案SK-LSH对LSH关键参数的敏感性,算法实用性进一步提升.Nearest neighbor(NN)search in high-dimensional space is a fundamental paradigm in a wide range of applications,such as text information retrieval,search engine,content-based information query,duplication detection,etc.In these areas,data are usually large-scale and are modeled as high-dimensional features,which introduce two major problems to NN search,the“curse of dimensionality”and I/O performance bottleneck.On the one hand,the dimensionality curse makes exact NNs nearly infeasible to achieve.A lot of research efforts have been devoted to finding approximate nearest neighbors(ANN)which are close enough to the query to achieve a satisfying trade-off between accuracy and efficiency.On the other hand,large-scale feature sets are often too massive to fit into the internal memory,external storage(usually disk)becomes a reasonable choice.However,due to the huge speed gap between internal memory and external memory,the resulting input/output communication becomes very expensive,too much NN candidates or improper loading manner would make the NN candidates loading the most time-consuming part of the entire NN search,called the I/O performance bottleneck.LSH is a widely adopted technique due to its excellent error guarantee and high computational efficiency in tackling the“curse of dimensionality”.It can enable fast and accurate irrelevant points filtration and offers c-ANN results at a sub-linear time complexity which also makes it an attractive approach for disk-based ANN search.Plenty of LSH based methods have been developed to further boost the ANN performance.However,most of them access candidate objects through significant random I/O operations,which makes them tend to incur the I/O performance bottleneck.In this work,a novel Locality-Sensitive Hashing(LSH)index called Optimal Order LSH(O2LSH)is designed to further address the above two problems.First,O2LSH introduces space-filling curves to sort the compound LSH keys and rearrange the original data points accordingly.In this way,NN candidates can be store
关 键 词:近似最近邻 高维索引 局部敏感哈希 空间线序 局部分布
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...