On Fine-Grained Distinct Element Estimation

Ilias Diakonikolas; Daniel Kane; Jasper C.H. Lee; Thanasis Pittas; David Woodruff; Samson Zhou

On Fine-Grained Distinct Element Estimation

Ilias Diakonikolas, Daniel Kane, Jasper C.H. Lee, Thanasis Pittas, David Woodruff, Samson Zhou

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We study the problem of distributed distinct element estimation, where $\alpha$ servers each receive a subset of a universe $[n]$ and aim to compute a $(1+\varepsilon)$-approximation to the number of distinct elements using minimal communication. While prior work establishes a worst-case bound of $\Theta\left(\alpha\log n+\frac{\alpha}{\varepsilon^2}\right)$ bits, these results rely on assumptions that may not hold in practice. We introduce a new parameterization based on the number $C = \frac{\beta}{\varepsilon^2}$ of pairwise collisions, i.e., instances where the same element appears on multiple servers, and design a protocol that uses only $O\left(\alpha\log n\log\log n+\frac{\sqrt{\beta}}{\varepsilon^2} \log n\right)$ bits, breaking previous lower bounds when $C$ is small. We further improve our algorithm under assumptions on the number of distinct elements or collisions and provide matching lower bounds in all regimes, establishing $C$ as a tight complexity measure for the problem. Finally, we consider streaming algorithms for distinct element estimation parameterized by the number of items with frequency larger than $1$. Overall, our results offer insight into why statistical problems with known hardness results can be efficiently solved in practice.

Lay Summary: When data is spread across multiple servers, it's often important to estimate how many unique items there are without moving all the data to one place. Previous methods for this task focused on worst-case scenarios and required a lot of communication between servers, which can be inefficient and impractical. We develop a new approach that looks at how often the same item appears on multiple servers—a quantity we call "collisions"—and uses this to guide the communication strategy. This leads to much lower communication in typical cases, making the process faster and more scalable. Our results also help explain why these estimation tasks are easier in practice than worst-case theory suggests.

Primary Area: Theory->Everything Else

Keywords: distinct elements, distributed model, communication complexity, data streams

Submission Number: 3498

Loading