Realizing Efficient On-Device Language-based Image Retrieval

Zhiming Hu, Mete Kemertas, Lan Xiao, Caleb Phillips, Iqbal Mohomed, Afsaneh Fazly

Published: 2024, Last Modified: 06 Mar 2025ACM Trans. Multim. Comput. Commun. Appl. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Advances in deep learning have enabled accurate language-based search and retrieval (e.g., over user photos) in the cloud. Many users prefer to store their photos in the home due to privacy concerns. As such, a need arises for models that can perform cross-modal search on resource-limited devices. State-of-the-art (SOTA) cross-modal retrieval models achieve high accuracy through learning entangled representations that enable fine-grained similarity calculation between a language query and an image, but at the expense of having a prohibitively high retrieval latency. Alternatively, there is a new class of methods that exhibits good performance with low latency but requires a lot more computational resources and an order of magnitude more training data (i.e., large web-scraped datasets consisting of millions of image–caption pairs), making them infeasible to use in a commercial context. From a pragmatic perspective, none of the existing methods are suitable for developing commercial applications for low-latency cross-modal retrieval on low-resource devices. We propose CrispSearch, a cascaded approach that greatly reduces the retrieval latency with minimal loss in ranking accuracy for on-device language-based image retrieval. The idea behind our approach is to combine a light-weight and runtime-efficient coarse model with a fine re-ranking stage. Given a language query, the coarse model effectively filters out many of the irrelevant image candidates. After this filtering, only a handful of strong candidates will be selected and sent to a fine model for re-ranking. Extensive experimental results with two SOTA models for the fine re-ranking stage on standard benchmark datasets show that CrispSearch results in a speedup of up to 38 times over the SOTA fine methods with negligible performance degradation. Moreover, our method does not require millions of training instances, making it a pragmatic solution to on-device search and retrieval.