Efficient Stream Sampling for Variance-Optimal Estimation of Subset SumsOpen Website

2011 (modified: 13 May 2023)SIAM J. Comput. 2011Readers: Everyone
Abstract: From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, $\textnormal{\sc VarOptk}$, that dominates all previous schemes in terms of estimation quality. $\textnormal{\sc VarOptk}$ provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in $O(\log k)$ time. Finally, it is particularly well suited for combinations of samples from different streams in a distributed setting.
0 Replies

Loading