A comparison of Hadoop, Spark and Storm for the task of large scale image classification

Published: 01 Jan 2018, Last Modified: 07 May 2025SIU 2018EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: An image-based retrieval system (IRS) finds the most relevant images of a query among all the images in the database. When the size of the database exceeds the storage capacity of a single machine, conventional systems are not adequate. Hadoop provides a solution by distributing data and processing over any available commodity hardware. It works well for batch processing and when low-latency is not required. When online processing and low-latency is needed Spark and Storm offer solutions. In this paper, we perform two comparisons regarding Hadoop, Spark and Storm frameworks. In the first one, Hadoop MapReduce (M/R) and Spark are analysed for the image indexing task. The results show that Hadoop MapReduce (M/R) performs better than Spark in case we have no iterative operations on data (e.g., indexing) since no intermediate disk writes are needed. On the other hand, Spark performs better when it comes to iterative operations (e.g., Word Count). In the second comparison, Spark and Storm are compared for the task of classification. Storm yields better latency than Spark while both methods are quite stable as the number of queries increases. This analysis could be useful for researchers and developers of distributed image processing systems.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview