NanoFold: Designing Reproducible Protein Structure Benchmarks through Principled Sampling

Published: 28 May 2026, Last Modified: 28 May 2026GenBio 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Protein structure prediction, AlphaFold, Benchmarking, Reproducibility, Dataset curation
TL;DR: NanoFold is a compact benchmark with statistically principled, leakage-controlled dataset curation across three sealed-evaluation tracks, making AlphaFold-style training studies reproducible at compute-accessible scale.
Abstract: Open-source AlphaFold (AF)-style systems have rapidly advanced protein structure prediction, but isolating the architectural and training choices that drive the gains remains difficult: production-scale training is computationally prohibitive outside well-resourced labs, and public corpora carry structural biases that compound under tight compute. As AI agents increasingly automate the ML research loop, accessible, tractable, automatically scorable, and distributionally representative benchmarks are needed. We introduce $\textbf{NanoFold}$, a compact fixed-data benchmark for AF-style training studies, paired with a codebase built for controlled head-to-head comparison. NanoFold defines three tracks with held-out test labels: a $\textit{limited}$ track for sample efficiency, a $\textit{research large}$ track for whether early gains persist under further optimization, and an $\textit{unlimited}$ track for best achievable performance under the fixed budget. Splits are disjoint by MMseqs2 sequence cluster and PDB entry and stratified on structural metadata, yielding $10{,}000$ training, $1{,}000$ public-validation, and $1{,}000$ sealed test chains. We verify the construction via structural features, sequence-family coverage, and protein foundation model embeddings, plus a randomization study over $1{,}000$ alternative splits, finding NanoFold statistically typical and well-distributed. Using OpenAI's GPT-5.5 in the Codex harness to autonomously run experiments across scales and regimes, we show the benchmark is learnable but unsaturated, scales predictably with budget, and separates training primitives, enabling transparent, reproducible architectural research at compute-accessible scale.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 157
Loading