Testing Closeness of Multivariate Distributions via Ramsey Theory

Published: 01 Jan 2024, Last Modified: 11 Jan 2025STOC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We investigate the statistical task of closeness (or equivalence) testing for multidimensional distributions. Specifically, given sample access to two unknown distributions p, q on d, we want to distinguish between the case that p=q versus ||p−q||Ak > є, where ||p−q||Ak denotes the generalized Ak distance between p and q — measuring the maximum discrepancy between the distributions over any collection of k disjoint, axis-aligned rectangles. Our main result is the first closeness tester for this problem with sub-learning sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. In more detail, we provide a computationally efficient closeness tester with sample complexity O((k6/7/ polyd(є)) logd(k)). On the lower bound side, we establish a qualitatively matching sample complexity lower bound of Ω(k6/7/poly(є)), even for d=2. These sample complexity bounds are surprising because the sample complexity of the problem in the univariate setting is Θ(k4/5/poly(є)). This has the interesting consequence that the jump from one to two dimensions leads to a substantial increase in sample complexity, while increases beyond that do not. As a corollary of our general Ak tester, we obtain dTV-closeness testers for pairs of k-histograms on d over a common unknown partition, and pairs of uniform distributions supported on the union of k unknown disjoint axis-aligned rectangles. Both our algorithm and our lower bound make essential use of tools from Ramsey theory.
Loading