Missing g-Mass: Investigating the Missing Parts of Distributions

Prafulla Chandra, Andrew Thangaraj

Published: 01 Jan 2024, Last Modified: 11 Oct 2024IEEE Trans. Inf. Theory 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Estimating the underlying distribution from iid samples is a classical and important problem in statistics. When the alphabet size is large compared to number of samples, a portion of the distribution is highly likely to be unobserved or sparsely observed. The missing mass, defined as the sum of probabilities $\Pr (x)$ over the missing letters x, and the Good-Turing estimator for missing mass have been important tools in large-alphabet distribution estimation. In this article, given a positive function g from $[{0,1}]$ to the reals, the missing g-mass, defined as the sum of $g(\Pr (x))$ over the missing letters x, is introduced and studied. The missing g-mass can be used to investigate the structure of the missing part of the distribution. Specific applications for special cases such as order- $\alpha $ missing mass ( $g(p)=p^{\alpha }$ ) and the missing Shannon entropy ( $g(p)=-p\log p$ ) include estimating distance from uniformity of the missing distribution and its partial estimation. Minimax estimation is studied for order- $\alpha $ missing mass for integer values of $\alpha $ and exact minimax convergence rates are obtained. Concentration is studied for a class of functions g and specific results are derived for order- $\alpha $ missing mass and missing Shannon entropy. Sub-Gaussian tail bounds with near-optimal worst-case variance factors are derived. Two new notions of concentration, named strongly sub-Gamma and filtered sub-Gaussian concentration, are introduced and shown to result in right tail bounds that are better than those obtained from sub-Gaussian concentration.