DisITQ: A Distributed Iterative Quantization Hashing Learning Algorithm

Qun Chen, Bo Lang, Xianglong Liu, Zepeng Gu

Published: 2016, Last Modified: 13 Nov 2024ISCID (2) 2016EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In the field of big data retrieval, hashing based approximate nearest neighbors (ANN) search has attracted many attentions. However, most existing hashing algorithms are learned from the centralized settings and based on small scale datasets, or in other words, they are single machine approaches which load the training data into memory to get models. For big data processing, models learned from large scale datasets which have the properties of big data such as variety often have better performance. However, there are two critical problems when training datasets are in very large size. First, a single compute node can't load all the data into memory to train hashing models. Second, in real-word applications, the data is often stored or even collected in a distributed manner, and it's infeasible to gather all data into a fusion center because of the prohibitively expensive communication and computation overhead. In this article, we present a distributed learning algorithm which is based on MapReduce and Iterative Quantization (ITQ) to train hashing functions. The proposed method, named as distributed iterative quantization hashing (DisITQ), can not only be performed on large scale datasets, but can also be applied to distributed data storing scenarios. Massive experiments carried out on large scale datasets demonstrate the time efficiency and the accuracy advantages of the method we proposed in comparison with the state-of-the-art hashing algorithms.