Testing with Non-identically Distributed Samples

TMLR Paper5180 Authors

23 Jun 2025 (modified: 26 Jun 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size $k$, $p_1, p_2,\ldots,p_T$, and we obtain $c$ independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, $p_{\mathrm{avg}}$. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities --- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with $c=1$ samples from each distribution, $\Theta(k/\varepsilon^2)$ samples are necessary and sufficient to learn $p_{\mathrm{avg}}$ to within error $\varepsilon$ in $\ell_1$ distance. To test uniformity or identity --- distinguishing the case that $p_{\mathrm{avg}}$ is equal to some reference distribution, versus has $\ell_1$ distance at least $\varepsilon$ from the reference distribution, we show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution. In contrast, for $c \ge 2$, we recover the usual sublinear sample testing guarantees of the i.i.d. setting: we show that $O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)$ total samples are sufficient, matching the optimal sample complexity in the i.i.d. case in the regime where $\varepsilon \ge k^{-1/4}$. Additionally, we show that in the $c=2$ case, there is a constant $\rho > 0$ such that even in the linear regime with $\rho k$ samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same $p_i$) can perform uniformity testing. We further extend our techniques to the problem of testing ''closeness'' of two distributions: given $c=3$ independent draws from each of $p_1, p_2,\ldots,p_T$ and $q_1, q_2,\ldots,q_T$, one can distinguish the case that $p_{\mathrm{avg}}=q_{\mathrm{avg}}$ versus having $\ell_1$ distance at least $\varepsilon$ using $O(k^{2/3}/\varepsilon^{8/3})$ total samples, where $k$ is an upper bound on the support size, matching the optimal sample complexity of the i.i.d. setting up to the $\varepsilon$-dependence.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Jasper_C.H._Lee1
Submission Number: 5180
Loading