Partitioned-Learned Count-Min Sketch

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: count-min sketch, heavy hitters, frequent items, learning augmented algorithms, streaming algorithms
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: We propose Partitioned Learned Count-Min Sketch (PL-CMS), a new approach to learning augmented frequent item identification in data streams. Our method builds on the learned Count-Min Sketch (LCMS) algorithm of Hsu et al. (ICLR 2019), which combines a standard Count-Min Sketch frequency estimation data structure with a learned model, by partitioning items in the input stream into two sets. Items with sufficiently high predicted frequencies have their frequencies tracked exactly, while the remaining items, with low predicted frequencies, are placed into the Count-Min Sketch data structure. Inspired by an approach of Vaidya et al. for learning augmented Bloom filters (ICLR 2021), our PL-CMS algorithm partitions items into different sets, based on multiple predicted frequency thresholds. Each set is handled by a separate Count-Min Sketch data structure. Unlike classic LCMS, this allows the algorithm to take advantage of the full prediction space of the learned model. We demonstrate that, given fixed partitioning thresholds, the parameters of our data structure can be efficiently optimized using a convex program. Empirically, we show that, on a variety of benchmarks, PL-CMS obtains a lower false positive rate for frequent item identification as compared to LCMS and standard Count-Min Sketch.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2979
Loading