Ensemble Learned Bloom Filters: Two Oracles are Better than One

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Bloom filters (BF) are space-efficient probabilistic data structures for approximate membership testing. Boosted by the proliferation of machine learning, learned Bloom filters (LBF) were recently proposed by augmenting the canonical BFs with a learned oracle as a pre-filter, the size of which is crucial to the compactness of the overall system. In this paper, inspired by ensemble learning, we depart from the state-of-the-art single-oracle LBF structure by demonstrating that, by leveraging multiple learning oracles of smaller size and carefully optimizing the accompanied backup filters, we can significantly boost the performance of LBF under the same space budget. We then design and optimize ensemble learned Bloom filters for mutually independent and correlated learning oracles respectively. We also empirically demonstrate the performance improvement of our propositions under three practical data analysis tasks.
Lay Summary: Bloom filters are efficient tools for checking if an item is part of a set, but they can sometimes give false positives. Recently, learned Bloom filters were introduced, using a machine learning model to improve accuracy. However, the model can become large and inefficient, limiting their practical use. Our research proposes to use multiple smaller machine learning models instead of one large one. By combining these smaller models, we can significantly reduce the false positive rate while keeping the system compact and efficient. We develope algorithms to optimize the selection and combination of these models, ensuring the best performance under a given memory budget. Our experiments demonstrate that our proposition outperforms existing methods in terms of accuracy and efficiency. Our approach can be applied to various real-world tasks, such as detecting malicious URLs or scanning virus signatures, making these processes faster and more reliable. By improving the accuracy of membership checks, we can enhance the security and efficiency of many applications that rely on Bloom filters.
Primary Area: Optimization
Keywords: Learned Bloom Filters, membership testing, learning-augmented algorithms, combinatorial optimization
Submission Number: 1750
Loading