一种基于LSH面向二元混合类型数据的相似性查询方法  被引量:5

A LSH-Based Method for Similarity Queries on Binary Hybrid Data

在线阅读下载全文

作  者:朱命冬 申德荣 寇月 聂铁铮 于戈 ZHU MingDong;SHEN DeRong;KOU Yue;NIE TieZheng;YU Ge(College of Computer Science and Engineering,Northeastern University,Shenyang 110819;Department of Computer Science and Technology,Henan Institute of Technology,Xinxiang,Henan 453003)

机构地区:[1]东北大学计算机科学与工程学院,沈阳110819 [2]河南工学院计算机科学与技术系,河南新乡453003

出  处:《计算机学报》2018年第8期1827-1843,共17页Chinese Journal of Computers

基  金:国家"九七三"重点基础研究发展规划项目基金(2012CB316201);国家自然科学基金(61472070;61672142);河南省高等学校重点科研项目计划(18B520011)资助~~

摘  要:局部敏感哈希方法(LSH)已经被广泛用于高维数据和大规模数据集的最近邻查询,然而现有方法大多将LSH方法用于单一类型的数据,文中尝试将LSH方法用于二元混合类型数据,如图像-文本数据,空间-文本数据等.文中提出了一种基于LSH混合索引结构的相似性查询方法,该方法可有效地管理含两种数据类型的数据,并且融合两种数据类型的相似性进行最近邻查询.文中提出的查询方法主要有三个特点:首先,结合LSH方法为混合数据构建混合哈希值,该混合哈希值保留有数据对象之间内容相似性的信息,基于混合哈希值构建哈希索引,进行快速准确的最近邻查询;其次,该方法解决传统LSH方法固定敏感半径的问题,可以有效地处理可变查询范围的相似性查询;最后,该方法在分布式环境中不需要全局索引信息,保证分布式查询的伸缩性.文中通过理论分析证明了查询方法和查询算法的准确性和有效性,进一步通过分布式系统优化及基于真实数据和合成数据的大量实验验证了方法的伸缩性和高效性.Locality Sensitive Hashing (LSH) has been widely used in high-dimensional data and large - scale datasets for nearest neighbor queries, but LSH is mostly applied for data with a single data type, so this paper attempts to adapt LSH to binary hybrid data, such as data with image and text, data with location and text, etc. On the one hand, an effective index is difficult to be built on binary hybrid data, which are normally high-dimensional data consisting of different data structures. On the other hand, for different data types, distance functions are different. Such as Euclidean distance function for numeric data, Jaccard distance function for character data, etc., soflexible similarity queries are difficult to efficiently processed on binary hybrid data, which should consider similarity of two data types simultaneously. To this end, this paper proposes a generalized LSH based method for similarity queries on binary hybrid data. The method effectively manages data consisting of two different data types, and processes nearest neighbor queries by considering similarity of both data types. The method mainly has three features: Firstly, an LSH based hybrid hash value, which conserves similarity among data objects, is constructed for each data object, and then a hash index is built on hybrid hash values for processing nearest neighbor queries efficiently. Specifically, one single type hash value for each data type is generated by the corresponding LSH function, and then the hybrid hash value is built by concatenating two hash values of single data type. Intuitively similar objects have close hybrid hash values, so that it is with high probability that similar objects are stored in close hash buckets and can be found with few disk I/O in processing nearest neighbor queries. The lengths of the two hash values are judiciously chosen to make nearest neighbor query correct and effective. Secondly, the method overcomes the limit of traditional LSH based methods which have fixed sensitive query ranges, and effectively

关 键 词:二元混合数据 相似性查询 局部敏感哈希 分布式查询算法 最近邻查询 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象