Keywords: binding-affinity prediction, deep mutational scanning, high-throughput screening
TL;DR: We provide a unified framework for end-to-end in-silico protein binding affinity maturation based on deep mutational scanning data.
Abstract: Computational prediction of protein binding affinity is a cornerstone of modern drug development, accelerating tasks from lead optimization to de novo protein design. However, progress is often hampered by evaluation practices, such as Random Cross-Validation (RandomCV), that can substantially overestimate model generalization on real-world tasks and lacking experimental validation. To address this, we introduce HAIPR, a unified framework that standardizes the entire modeling pipeline from training and optimization to inference, providing an initial selection of algorithms, robust evaluation protocols and curated benchmark datasets. By extending the BindingGYM benchmark and implementing more realistic, biologically meaningful data splits, our framework reveals that model performance on these challenging tasks is substantially lower than suggested by RandomCV. We systematically compare classical machine learning approaches, such as Support Vector Regression (SVR) on protein language model (pLM) embeddings, with parameter-efficient fine-tuning (PEFT) of pLMs. Our results show that SVR can be competitive in low-data regimes and less prone to model collapse, while PEFT methods offer clear advantages as dataset size and problem complexity increase. Furthermore, we analyze the minimum data requirements for reliable prediction and demonstrate that even modestly sized models can achieve performance that rivals the experimental reproducibility between state-of-the-art affinity assays, highlighting a critical ceiling for in silico prediction. Code and pre-computed embeddings are made available.
Supplementary Material: zip
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 20560
Loading