Keywords: Cardinality estimation, streaming algorithms, data sketch, distinct elements problem
TL;DR: Our study shows that hash value distribution doesn't impact minimum variance in extreme value-based counters; validated by introducing Pareto sketching, which matches precision of established exponential methods.
Abstract: —Accurately assessing the count of unique elements
within voluminous data streams remains a critical task in
data analytics. The pioneering Flajolet-Martin algorithm and its
descendants, such as HyperLogLog, have pioneered the arena
of probabilistic counting techniques. However, there has been
ongoing discussion regarding the impact of hash function value
distribution on the performance of these algorithms. This study
disputes the widely held belief that the accuracy of cardinality
estimation algorithms is highly dependent on the distribution
of hash values. We demonstrate that, for a broad spectrum of
estimators, the minimum possible variance, as dictated by the
Cram´ er-Rao lower bound, is actually unaffected by the choice
of hash value distribution in extreme value-based counters. To
validate our theoretical assertions, we present a novel sketching
method called Pareto sketching. Our empirical tests show that
this method delivers precision on par with the established
exponential sketching methods. Our work not only simplifies
the design of future sketching algorithms but also opens new
directions for research in cardinality estimation that are not
constrained by distributional choices.
Submission Number: 16
Loading