Efficient Statistics With Unknown Truncation, Polynomial Time Algorithms, Beyond Gaussians

Jane H. Lee, Anay Mehrotra, Manolis Zampetakis

Published: 31 Oct 2024, Last Modified: 16 Oct 2024FOCS 2024EveryoneRevisionsCC0 1.0

Abstract: We study the estimation of distributional parameters when samples are shown only if they fall in some unknown set S ⊆ Rd. Kontonis, Tzamos, and Zampetakis (FOCS’19) gave a dpoly(1/ε) time algorithm for finding ε-accurate parameters for the special case of Gaussian distributions with diagonal covariance matrix. Recently, Diakonikolas, Kane, Pittas, and Zarifis (COLT’24) showed that this exponential dependence on 1/ε is necessary even when S belongs to some well-behaved classes. These works leave the following open problems which we address in this work: Can we estimate the parameters of any Gaussian or even extend the results beyond Gaussians? Can we design poly(d/ε) time algorithms when S is a simple set such as a halfspace? We make progress on both of these questions by providing the following results: 1. Toward the first question, we provide an estimation algorithm with sample and time complexity dpoly(ℓ/ε) for any exponential family that satisfies some structural assumptions and any unknown set S that is ε-approximable by a degree-ℓ polynomial. This result has two important applications: (a) The first algorithm for estimating arbitrary Gaussian distributions (even with non-diagonal covariance matrix) from samples truncated to an unknown set S; and (b) The first algorithm for linear regression with unknown truncation and Gaussian features. 2. To address the second question, we provide an algorithm with poly(d/ε) sample and time complexity that works for a set of exponential families (that contains all multivariate Gaussians) when S is a halfspace or an axis-aligned rectangle. This is the first fully polynomial time algorithm for estimation with an unknown truncation set. Along the way, we develop new tools that may be of independent interest, including: 3. A reduction from PAC learning with positive and unlabeled samples to PAC learning with positive and negative samples that is robust to certain covariate shifts; and 4. The first polynomial time algorithm for learning halfspaces using only positive examples when the samples have an unknown Gaussian distribution.