基于随机森林的哈希检索算法  被引量:5

Hashing Retrieval Method Based on Random Forest

在线阅读下载全文

作  者:花强[1] 郭欣欣 张峰[1] 董春茹[1] HUA Qiang;GUO Xinxin;ZHANG Feng;DONG Chunru(Key Laboratory of Machine Learning and Computational Intelligence of Hebei Province,Hebei University,Baoding,Hebei 071002,China)

机构地区:[1]河北大学河北省机器学习与计算智能重点实验室

出  处:《计算机科学与探索》2019年第7期1174-1183,共10页Journal of Frontiers of Computer Science and Technology

基  金:河北省自然科学基金面上项目Nos.F2018201115,F2018201096;河北省教育厅青年基金No.QN2017019;河北省教育厅科学技术研究重点项目No.ZD2019021~~

摘  要:从海量数据中进行近似数据的检索是数据挖掘领域许多应用的关键。尤其近年来,数据的规模出现爆炸式增长,数据检索需面对海量数据和“维度灾难”的叠加考验,这使得传统最近邻算法效率降低,而近似最近邻算法发挥了越来越重要的作用。其中哈希算法以其在存储空间和计算时间上的优势受到了广泛关注。提出了一种基于随机森林的哈希算法。该算法通过构建随机森林,将原始空间的样本映射为海明空间的二进制哈希码,并在哈希空间上定义了顺序敏感的海明距离,以最大程度保持数据在原空间的近邻关系不变。由于随机森林中不同决策树所使用的特征空间和学习过程是独立的,可以以增量的方式灵活地确定哈希码的长度。此外基于随机森林的哈希编码算法天然适合并行部署,从而可以大大提高算法速度。最后,在MNIST和CIFAR-10数据集对所提算法进行了实验验证,结果表明了算法的有效性和出色性能。The retrieval of approximate data from massive data is the key to many applications in data mining. Especially in the recent years, the scale of data has exploded, and data retrieval needs to face the overlapping test of massive data and“dimension disaster”, which makes the efficiency of traditional nearest neighbor algorithm decrease, while the approximate nearest neighbor algorithms are playing an increasingly important role. Among them, the Hashing algorithm has received extensive attention due to its advantages in storage space and computation time. This paper proposes a Hashing algorithm based on random forest. In the method, by constructing a random forest, the sample of the original space is mapped to the binary Hash code of Hamming space, and an order- sensitive Hamming distance is defined to keep the neighbor relationship of the data in the original space to the maximum extent. Since the feature space and learning process used by different decision trees in a random forest are independent, the length of the Hash code can be flexibly determined in an incremental manner. In addition, the proposed algorithm based on random forest is naturally suitable for parallel deployment, which can greatly improve the speed of the algorithm. Finally, the proposed algorithm is experimentally verified in the MNIST and CIFAR-10 data sets. The results demonstrate the effectiveness and great performance of the algorithm.

关 键 词:近似近邻检索(ANNS) 哈希编码 随机森林 顺序敏感的海明距离 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象