Keywords: Fitness Prediction, Hierarchical Sampling, Random Selection
TL;DR: Random Selection is not the optimal
Abstract: Protein fitness prediction models enable sequence design but depend critically on which variants are experimentally measured. Prior work claimed that random sampling of sequences is consistently better than structured ``hierarchical'' sampling, contradicting the intuition that diversity should help in small-data regimes. We show that this claim was driven by insufficient statistical power rather than biology. Re-evaluating sampling strategies on the DHFR fitness landscape with \emph{ten} replicates per condition (vs.\ three in prior work), we find that hierarchical sampling significantly outperforms random sampling when data are scarce: the two-synonymous-amino-acids strategy achieves 5.5\% lower test loss at 200 sequences and 4.2\% lower test loss at 1{,}000 sequences. The advantage disappears only once the training set exceeds $\sim$3{,}000 sequences. We explain this behavior with an information-theoretic model: hierarchical strategies maximize amino acid sequence coverage, increasing mutual information between sampled sequences and fitness labels. Our results overturn previous recommendations, provide concrete guidelines for experimental design under realistic assay budgets, and highlight the importance of replication and power analysis in computational biology.
Submission Number: 7
Loading