Optimal Estimation of the Null Distribution in Large-Scale Inference

Subhodh Kotekal; Chao Gao

Optimal Estimation of the Null Distribution in Large-Scale Inference

Subhodh Kotekal, Chao Gao

Published: 01 Jan 2025, Last Modified: 14 May 2025IEEE Trans. Inf. Theory 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The advent of large-scale inference has spurred reexamination of conventional statistical thinking. In a series of highly original articles, Efron persuasively illustrated the danger for downstream inference in assuming the veracity of a posited null distribution. In a Gaussian model for n many z-scores with at most $k \lt \frac {n}{2}$ nonnulls, Efron suggests estimating the parameters of an empirical null $N(\theta , \sigma ^{2})$ instead of assuming the theoretical null $N(0, 1)$ . Looking to the robust statistics literature by viewing the nonnulls as outliers is unsatisfactory as the question of optimal rates is still open; even consistency is not known in the regime $k \asymp n$ which is especially relevant to many large-scale inference applications. However, provably rate-optimal robust estimators have been developed in other models (e.g. Huber contamination) which appear quite close to Efron’s proposal. Notably, the impossibility of consistency when $k \asymp n$ in these other models may suggest the same major weakness afflicts Efron’s popularly adopted recommendation. A sound evaluation thus requires a complete understanding of information-theoretic limits. We characterize the regime of k for which consistent estimation is possible, notably without imposing any assumptions at all on the nonnull effects. Unlike in other robust models, it is shown consistent estimation of the location parameter is possible if and only if $\frac {n}{2} {-} k = \omega (\sqrt {n})$ , and of the scale parameter in the entire regime $k \lt \frac {n}{2}$ . Furthermore, we establish sharp minimax rates and show estimators based on the empirical characteristic function are optimal by exploiting the Gaussian character of the data.

Loading