检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王伟迪 陈珂[1] 胡天磊[1] 陈刚[1] 寿黎但[1] WANG Weidi;CHEN Ke;HU Tianlei;CHEN Gang;SHOU Lidan(College of Computer Science and Technology,Zhejiang University,Hangzhou 310027,China)
机构地区:[1]浙江大学计算机科学与技术学院,杭州310027
出 处:《计算机科学与探索》2018年第11期1718-1728,共11页Journal of Frontiers of Computer Science and Technology
基 金:国家重点基础研究发展计划No.2015CB352400;国家自然科学基金Nos.61672455;61472348;浙江省基础公益研究项目No.LY18F020005~~
摘 要:单词移动距离(word mover.s distance,WMD)是最近提出的一种有效的文档相似性度量方式,其融合了Word2Vec词向量表达的语义信息,并依据推土机距离(earth mover.s distance,EMD)计算文档间的距离。然而,单词移动距离存在两个缺陷:第一点是它采用不够精确的词频来作为单词的权重;第二点是单词移动距离度量下的查询效率很低。为了改善应用单词移动距离时的效果,考虑到单词的重要性而采用TF-IDF(term frequency-inverse document frequency)评分作为单词权重,进而得到一种改进的单词移动距离(TI-WMD)。为了提高单词移动距离度量下的文档查询效率,提出了一种近似的层次化查询方法。首先,依据文档的单词质心向量采用局部敏感哈希为文档集合构建哈希索引。在查询过程中,依据查询文档的单词质心向量和多探寻局部敏感哈希方法获得候选文档集,接着依据文档标签与过滤-细化框架在候选文档集中获得TI-WMD度量下的近似k近邻。在Reuters-21578和20-Newsgroups两个文档数据集上的实验结果表明,相对于WMD与PrefetchPrune方法,TI-WMD与层次化查询在准确性和效率上更具优势。Recently,a good metric,word mover's distance(WMD)has been proposed to measure document similarity,which combines the Word2Vec model and computes the distance between two documents by earth mover's distance(EMD).However,it has two drawbacks.Firstly,it takes imprecise term frequency as word weight.Secondly,the existing query method is extremely inefficient under WMD metric.In order to increase the accuracy of document applications when using WMD,TF-IDF(term frequency-inverse document frequency)is taken as word weight and an improved WMD(TI-WMD)is achieved.To improve the efficiency of document query under WMD metric,this paper proposes an approximate hierarchical query mehod(HQ).First of all,according to word centroid vector of every document,it constructs the indexes for the document data set by locality-sensitive hashing(LSH).In the query process,it achieves a candidate document set according to the word centroid of the query document and multi-probe LSH query method.And then,on the basis of the document signature and the filter-and-refinement frame,it can obtain the approximate k nearest neighbor under TI-WMD metric.The empirical study over Reuters-21578 and 20-Newsgroups data sets demonstrates the superior accuracy and efficiency of TI-WMD metric and the hierarchical query method,compared with WMD and PrefetchPrune under WMD metric.
关 键 词:单词移动距离 推土机距离 局部敏感哈希 近似k近邻 层次化查询
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.249