Learning-based Support Estimation in Sublinear Time

Talya Eden; Piotr Indyk; Shyam Narayanan; Ronitt Rubinfeld; Sandeep Silwal; Tal Wagner

Learning-based Support Estimation in Sublinear Time

Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner

Published: 12 Jan 2021, Last Modified: 05 May 2023ICLR 2021 SpotlightReaders: Everyone

Keywords: support estimation, sublinear, learning-based, distinct elements, chebyshev polynomial

Abstract: We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $ \pm \varepsilon n$ from a sample of size $O(\log^2(1/\varepsilon) \cdot n/\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to $$ \ \log (1/\varepsilon) \cdot n^{1-\Theta(1/\log(1/\varepsilon))}. $$ We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from {Hsu et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.

One-sentence Summary: A learning-based algorithm for support size estimation, which given a sufficiently accurate predictor, improves both provably and empirically over state-of-the-art algorithms that do not use a predictor.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Supplementary Material: zip

16 Replies

Loading