检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:吕宏伟 李博 刘普凡 刘识 李继伟 刘俊健 LüHongwei;Li Bo;Liu Pufan;Liu Shi;Li Jiwei;Liu Junjian(Big Data Center of StateGrid Corporation of China,Nanjing 210023,China)
出 处:《南京师大学报(自然科学版)》2024年第2期99-108,共10页Journal of Nanjing Normal University(Natural Science Edition)
基 金:国家电网有限公司大数据中心自建科技项目(SGSJ0000SJJS2310021).
摘 要:大数据时代,深度学习通过将复杂对象表示为高维特征向量,并使用向量之间的距离度量来衡量样本的相似性,在推荐系统、用户画像、数据中台管理等场景中得到了广泛的应用.但是,随着数据规模的不断增加,海量特征数据的相似向量检索面临着检索模型占用内容大、特征检索算法召回率较低的严重挑战.如何在保证检索精度的前提下,设计紧凑型索引图结构,降低特征检索的内存消耗,对于提升大数据系统的近邻检索效率具有重要的作用.因此,本文提出了一种均衡感知的快速K均值近邻聚类的特征数据分桶及其图结构紧凑型索引用于海量数据近邻检索.首先,设计了均衡感知的快速K-均值聚类算法,通过在图索引构建过程中海量特征数据的均衡分桶,将高维向量压缩成轻量级紧凑型图索引结构,随后通过量化操作进一步压缩高维向量样本,提升其在候选集上的最近邻检索速度.在基准数据集上实验验证结果表明,本文提出的方法能够在保证较高检测召回率的同时,有效加快索引构建速度,可以用于支持高维特征数据的高效最近邻检索.In the era of big data,deep learning has been widely applied in recommendation systems,user profiling,and data management by representing complex objects as high-dimensional feature vectors and evaluating their similarities based on vector distance measurements.However,with the continuous growth of data scale,the retrieval of similar feature vectors from massive data faces significant challenges such as large memory consumption of retrieval models and low recall rates of feature retrieval algorithms.It is crucial to design compact index graph structures and reduce memory consumption in feature retrieval to improve the efficiency of nearest neighbor search in large-scale data systems while ensuring retrieval accuracy.Therefore,a balanced-aware distributed K-means clustering-based user feature binning approach and a compact index design algorithm for graph structures are proposed.Firstly,fast balanced-aware K-means clustering algorithm is designed to achieve balanced binning of massive feature data during graph index construction,compressing high-dimensional vectors into lightweight and compact graph index structures.Subsequently,quantization operation is conducted to further compress high-dimensional vectors sample and improve its nearest neighbor search speed in dataset.Experimental results on benchmark datasets demonstrate that the proposed method can effectively accelerate index construction speed while ensuring high accuracy,thus enabling efficient indexing and retrieval of massive data.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49