基于均衡聚类索引的近似最近邻检索方法

Balanced Clustering-based Index for Approximate Nearest Neighbor Retrieval

作　　者：吕宏伟李博刘普凡刘识李继伟刘俊健 LüHongwei;Li Bo;Liu Pufan;Liu Shi;Li Jiwei;Liu Junjian(Big Data Center of StateGrid Corporation of China,Nanjing 210023,China)

机构地区：[1]国家电网大数据中心,江苏南京210023

出　　处：《南京师大学报（自然科学版）》2024年第2期99-108,共10页Journal of Nanjing Normal University(Natural Science Edition)

基　　金：国家电网有限公司大数据中心自建科技项目(SGSJ0000SJJS2310021).

摘　　要：大数据时代,深度学习通过将复杂对象表示为高维特征向量,并使用向量之间的距离度量来衡量样本的相似性,在推荐系统、用户画像、数据中台管理等场景中得到了广泛的应用.但是,随着数据规模的不断增加,海量特征数据的相似向量检索面临着检索模型占用内容大、特征检索算法召回率较低的严重挑战.如何在保证检索精度的前提下,设计紧凑型索引图结构,降低特征检索的内存消耗,对于提升大数据系统的近邻检索效率具有重要的作用.因此,本文提出了一种均衡感知的快速K均值近邻聚类的特征数据分桶及其图结构紧凑型索引用于海量数据近邻检索.首先,设计了均衡感知的快速K-均值聚类算法,通过在图索引构建过程中海量特征数据的均衡分桶,将高维向量压缩成轻量级紧凑型图索引结构,随后通过量化操作进一步压缩高维向量样本,提升其在候选集上的最近邻检索速度.在基准数据集上实验验证结果表明,本文提出的方法能够在保证较高检测召回率的同时,有效加快索引构建速度,可以用于支持高维特征数据的高效最近邻检索.In the era of big data,deep learning has been widely applied in recommendation systems,user profiling,and data management by representing complex objects as high-dimensional feature vectors and evaluating their similarities based on vector distance measurements.However,with the continuous growth of data scale,the retrieval of similar feature vectors from massive data faces significant challenges such as large memory consumption of retrieval models and low recall rates of feature retrieval algorithms.It is crucial to design compact index graph structures and reduce memory consumption in feature retrieval to improve the efficiency of nearest neighbor search in large-scale data systems while ensuring retrieval accuracy.Therefore,a balanced-aware distributed K-means clustering-based user feature binning approach and a compact index design algorithm for graph structures are proposed.Firstly,fast balanced-aware K-means clustering algorithm is designed to achieve balanced binning of massive feature data during graph index construction,compressing high-dimensional vectors into lightweight and compact graph index structures.Subsequently,quantization operation is conducted to further compress high-dimensional vectors sample and improve its nearest neighbor search speed in dataset.Experimental results on benchmark datasets demonstrate that the proposed method can effectively accelerate index construction speed while ensuring high accuracy,thus enabling efficient indexing and retrieval of massive data.

关键词：大数据检索与分析最近邻搜索均衡感知

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于均衡聚类索引的近似最近邻检索方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于均衡聚类索引的近似最近邻检索方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索