Minimax Posterior Contraction Rates for Unconstrained Distribution Estimation on $[0,1]^d$ under Wasserstein Distance

TMLR Paper3500 Authors

15 Oct 2024 (modified: 27 Nov 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We obtain asymptotic minimax optimal posterior contraction rates for estimation of probability distributions on $[0,1]^d$ under the Wasserstein-$p$ metrics using Bayesian Histograms. To the best of our knowledge, our analysis is the first to provide minimax posterior contraction rates for every $p \geq 1$ and problem dimension $d \geq 1$. Our proof technique takes advantage of the conjugacy of the Bayesian Histogram.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We thank the reviewers for the time and effort they have spent reading and providing feedback on our paper. This revision announcement will be followed by our responses to each reviewer. Here we first provide a detailed description of the most significant improvements to the paper included in the revision. 1. In the introduction, we now provide a more holistic summary of the prior works in PCR theory that have laid the foundation for our work. In particular, we reference and briefly describe three successful applications of the Ghosal framework for proving PCRs for distribution estimation (Scricciolo 2007, Kruijer and van der Vaart 2008, Shen et al 2013). See page 2. Additionally, we provide more explanation and positive tone when reviewing the works that we referenced in the original submission that specifically study Wasserstein PCRs (see our expanded discussion on Chae (2021), Camerlenghi et al. (2022), Rousseau \& Scricciolo (2023), Gao \& van der Vaart (2016), and Scricciolo (2018)). See end of page 2 and top of page 3 for these updates. 2. We have made more explicit the connection between PCRs and minimax rates. Specifically our Theorem 1 now proves a stronger PCR. Under this stronger characterization of Posterior Contraction, it is impossible for the proved PCR to decay faster than the minimax rate. This update is associated with the following changes: * Section 3 is new. In it we define a notion of posterior contraction (Definition 1), called a “minimax-conscious PCR”, that in general cannot decay faster than the minimax rate. This is proved in the new Lemma 1. We then connect the notion of a “minimax-conscious PCR” to the more traditional notion of a PCR that holds almost surely. In Definition 8.1 of the canonical text Fundamentals of Nonparametric Bayesian Inference (Ghosal \& van der Vaart 2017) the standard notion of a strong sense (almost sure convergent) PCR is provided. In our new Lemma 2, we show that a “minimax-conscious PCR” implies this strong sense PCR at the same rate. Recall in the original manuscript Theorem 1 only proved the weaker convergent "in-probability" PCR. * Because the new Theorem 1 is stronger than the original Theorem 1, its proof involves an additional auxiliary Lemma. Specifically, the old auxiliary Lemmas were Lemma 1 and Lemma 2. Those two Lemmas are now labeled Lemmas 3 and Lemma 5. The new auxiliary Lemma is Lemma 4, which provides an exponentially decaying upper bound on the concentration of the $p^{th}$ powered Wasserstein-$p$ distance between the true distribution and the histogram estimator around its mean using Mcdiarmid’s inequality. 3. We have newly included Section 5, which provides concise instructions to a practitioner on how to use the prior distribution to represent distributional beliefs before collecting data. While we believe the main contribution of our work is in advancing posterior contraction theory to achieve minimax optimal PCRs underneath Wasserstein distances under arbitrary dimension, reviewer iaVK importantly asks: what are possible benefits of the Posterior Mean Histogram over the Empirical Measure? As further discussed in our response to iaVK, a possible benefit of the Posterior Mean Histogram over the Empirical Measure comes from the ability of the Posterior Mean Histogram to incorporate prior knowledge, which may be useful when the sample size is small and correct prior knowledge is provided. Theorem 1 can thus be viewed as a robustness analysis. Specifically, Theorem 1 provides the practitioner with a guarantee that worst case expected loss will match the best performing purely frequentist procedure when the sample size is large even if incorrect prior knowledge is provided. Section 5 provides a guide for the practitioner to follow that allows for incorporation of prior knowledge that does not sacrifice the large sample competitiveness of the Bayesian histogram with the best performing frequentist procedure. 4. Other Improvements: See the individual responses to each reviewer for more information on additional minor changes that have been made.
Assigned Action Editor: ~Alp_Kucukelbir1
Submission Number: 3500
Loading