Estimating Unknown Population Sizes Using Hypergeometric Maximum Likelihood

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: multivariate hypergeometric distribution, maximum likelihood estimation, variational autoencoder, genomics
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: The multivariate hypergeometric distribution describes the fundamental process of sampling without replacement from a discrete population of elements divided into multiple categories. Despite the hypergeometric distribution's long history, the literature has not yet addressed the problem of maximum likelihood estimation when both the size of the total population and its constituent categories are unknown. Here, we show that this estimation challenge can be solved by maximizing the hypergeometric likelihood, even in the presence of severe under-sampling. We extend this approach to capture data generating processes where the ground-truth high-dimensional distribution is conditional on a continuous latent variable using the variational autoencoder framework, and validate the resulting model using simulated datasets. In a practical use case, we demonstrate that our method can recover the true number of gene transcripts present in a cell from sparse single-cell genomics data.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8423
Loading