- Keywords: Bayesian nonparametric, genomics, prediction, optimal experimental design
- Abstract: Despite the advent of Big Data, data-gathering in many domains can still be an expensive process that necessitates careful planning. For instance, in genomics, researchers can spend money and time to sequence a greater number of individual genomes -- or alternatively they can spend these resources to sequence individual genomes with increased accuracy. In either case, spending resources has the potential to reveal new variations in the genome and thereby new genetic insights. We consider the case where scientists have already conducted a pilot study to reveal some variants in a genome and are contemplating a follow-up study. We provide a novel prediction method, using Bayesian nonparametric methods, for how many variants scientists can expect to find in the follow-up based on the information in the pilot. When sequencing accuracy is kept constant between the pilot and follow-up, we demonstrate on (real) data from the gnomAD project that our prediction is more accurate than two recent proposals -- and as accurate as a more classic proposal. Unlike other existing methods though, our method allows practitioners to change the sequencing accuracy between the pilot and the follow-up. We demonstrate how this distinction allows our method to be used both for more realistic predictions as well as for optimal experimental design of the follow-up study under a resource budget.