Data-optimal scaling of paired antibody language models

Published: 24 Sept 2025, Last Modified: 26 Dec 2025NeurIPS2025-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: immunology, antibody language models, deep learning, scaling laws, transformers
TL;DR: Antibody language models (AbLMs) are constrained by data, which limits the quality of learned representations; we establish an AbLM-specific scaling law to determine the optimal number of paired sequences needed to train models of varying sizes.
Abstract: Scaling laws for large language models in natural language domains are typically derived under the assumption that performance is primarily compute-constrained. In contrast, antibody language models (AbLMs) trained on paired sequences are primarily data-limited, thus requiring different considerations. To explore how model size and data scale affect AbLM performance, we trained 15 AbLMs across all pairwise combinations of five model sizes and three training data sizes. From these experiments, we derive an AbLM-specific scaling law and estimate that training a data-optimal AbLM equivalent of the highly performant 650M-parameter ESM-2 protein language model would require $\sim 5.5$ million paired antibody sequences. Evaluation on multiple downstream classification tasks revealed that significant performance gains emerged only with sufficiently large model size, suggesting that in data-limited domains, improved performance depends jointly on both model scale and data volume.
Submission Number: 210
Loading