Abstract: Key value data sets of the form {(x, w <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">x</sub> )} where w <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">x</sub> > 0 are prevalent. Common queries over such data are segment f-statistics Q(f, H) = Σ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">x∈H</sub> f(w <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">x</sub> ), specified for a segment H of the keys and a function f. Different choices of f correspond to count, sum, moments, capping, and threshold statistics. When the data set is large, we can compute a smaller sample from which we can quickly estimate statistics. A weighted sample of keys taken with respect to f(w <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">x</sub> ) provides estimates with statistically guaranteed quality for f-statistics. Such a sample S(f) can be used to estimate g-statistics for g ≠ f, but quality degrades with the disparity between g and f. In this paper we address applications that require quality estimates for a set F of different functions. A naive solution is to compute and work with a different sample S <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">(f)</sup> for each f ∈ F. Instead, this can be achieved more effectively and seamlessly using a single multi-objective sample S <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">(F)</sup> of a much smaller size. We review multi-objective sampling schemes and place them in our context of estimating f-statistics. We show that a multi-objective sample for F provides quality estimates for any f that is a positive linear combination of functions from F. We then establish a surprising and powerful result when the target set M is all monotone non-decreasing functions, noting that M includes most natural statistics. We provide efficient multi-objective sampling algorithms for M and show that a sample size of k ln n (where n is the number of active keys) provides the same estimation quality, for any f ∈ M, as a dedicated weighted sample of size k for f.
0 Replies
Loading