Analysis of a Category of Probabilistic Cardinality Estimation Algorithms

XJTU 2024 CSUC Submission16 Authors

01 Apr 2024 (modified: 03 Apr 2024)XJTU 2024 CSUC SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Cardinality estimation, streaming algorithms, data sketch, distinct elements problem
TL;DR: Our study shows that hash value distribution doesn't impact minimum variance in extreme value-based counters; validated by introducing Pareto sketching, which matches precision of established exponential methods.
Abstract: —Accurately assessing the count of unique elements within voluminous data streams remains a critical task in data analytics. The pioneering Flajolet-Martin algorithm and its descendants, such as HyperLogLog, have pioneered the arena of probabilistic counting techniques. However, there has been ongoing discussion regarding the impact of hash function value distribution on the performance of these algorithms. This study disputes the widely held belief that the accuracy of cardinality estimation algorithms is highly dependent on the distribution of hash values. We demonstrate that, for a broad spectrum of estimators, the minimum possible variance, as dictated by the Cram´ er-Rao lower bound, is actually unaffected by the choice of hash value distribution in extreme value-based counters. To validate our theoretical assertions, we present a novel sketching method called Pareto sketching. Our empirical tests show that this method delivers precision on par with the established exponential sketching methods. Our work not only simplifies the design of future sketching algorithms but also opens new directions for research in cardinality estimation that are not constrained by distributional choices.
Submission Number: 16