RPA:一种内存高效的度量空间recall@R近似最近邻搜索索引  

RPA:a memory-efficient metric-space recall@R ANNS index

在线阅读下载全文

作  者:江润本 陈家颖 毛睿 JIANG Runben;CHEN Jiaying;MAO Rui(College of Computer Science and Software Engineering,Guangdong Provincial Key Laboratory of Popular High Performance Computers,Guangdong Provincial Engineering Center of China-made High Performance Data Computing System,Shenzhen Key Laboratory of Service Computing and Applications,Shenzhen University,Shenzhen 518060,Guangdong Province,P.R.China)

机构地区:[1]深圳大学计算机与软件学院,广东省普及型高性能计算机重点实验室,广东省国产高性能数据计算系统工程技术研究中心,深圳市服务计算与应用重点实验室,广东深圳518060

出  处:《深圳大学学报(理工版)》2023年第6期640-648,共9页Journal of Shenzhen University(Science and Engineering)

基  金:National Natural Science Foundation of China (62072311)。

摘  要:现有的度量空间的近似最近邻搜索(approximate nearest neighbor search, ANNS)方法通常依赖于预选择的支撑点构成的序列,序列中的支撑点按照到数据元素的距离升序排列.然而,大多数现有的度量空间ANNS方法由于索引结构复杂、支撑点过多或者未能充分利用距离信息导致搜索时内存开销巨大.为此,提出精简排列阵(reduced permutation array, RPA)的度量空间recall@R近似最近邻搜索方法.对于全体数据元素,RPA预先选择k个支撑点,对每个数据元素仅存储离该数据元素最近的l个(l<<k),并将所有元素的支撑点序列构建为一个数组结构.在搜索过程中,利用一种得分函数,该函数基于查询对象到各个支撑点的距离来近似计算数据元素到查询对象的距离.同时,维护一个有界最小堆,以保存R个候选结果数据元素.RPA具有结构简单、内存效率高和可扩展性强等特点.实验结果表明,在相同召回率的情况下,与排列索引(permutation-based index, P-index)相比,RPA平均具有高达3倍的内存压缩比.研究结果可在内存资源有限的单机环境下提供一种有效的针对海量数据的ANNS方法.Approximate nearest neighbor search(ANNS)for high dimensional data has received extensive research efforts.Many existing ANNS methods in metric spaces are essentially based on the permutation of pivots,or pre-selected reference points,which are ordered by their distances to the data.Unfortunately,most of these permutation-based methods are quite memory demanding,because of either complexity of index structure,excessive pivots,or insufficient exploitation of distances.This paper proposes the reduced permutation array(RPA),a delicate integration of existing technical elements based on in-depth understanding,for recall@R ANNS in metric spaces.As an array,for each data element,RPA stores only l identities(IDs)of its nearest pivots out of k pre-selected ones(l≪k).During search,the order of distances from each data element to the query is approximated by a score function based on the distances from its nearest pivots to the query,and a bounded heap is maintained to store R candidate results.RPA is simple,memory-efficient,and scalable.Experiments show that RPA is approximately 3 times more memory-efficient than existing methods on average.The research results can provide an effective ANNS method for massive data in a single-machine environment with limited memory resources.

关 键 词:计算机科学与技术 近似最近邻搜索 度量空间 索引结构 支撑点选择 支撑点序列 内存高效 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象