Structural plausibility without binding specificity: limits of AI-based antibody-antigen structure prediction scores

Published: 02 Mar 2026, Last Modified: 05 Mar 2026GEM 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: structure prediction, antibody-antigen interactions, nanobodies, binding specificity, AlphaFold3, Boltz-2, Chai-1, protein–protein docking, antibody discovery
TL;DR: AI antibody-antigen predictors generate plausible complexes, but confidence scores fail to separate true binders from mismatches, making them unreliable as standalone filters for real-world drug discovery pipelines
Abstract: Antibodies are central to modern immunotherapy, yet accurately predicting antibody-antigen binding interfaces remains a major challenge for computational modeling. While recent AI-based structure prediction methods can generate plausible antibody-antigen complexes, it remains unclear whether they can reliably discriminate cognate binding partners and identify correct paratope-epitope interfaces in realistic discovery settings. Here, we introduce a controlled benchmarking framework to systematically evaluate publicly available state-of-the-art structure prediction models (AlphaFold3, Boltz-2, and Chai-1) in their ability to distinguish real antibody-antigen complexes-cognate pairs, extracted from experimentally solved structures (n=106), from shuffled complexes, which serve as artificial non-cognate negative controls (n=11,342). We assessed structural accuracy, interface correctness, and the discriminative power of commonly used confidence metrics to distinguish cognate from non-cognate complexes (including ipTM and DockQ) when applied without access to structural ground truth, reflecting real-world deployment scenarios. We further analyzed sequence-level, structural, and sequence-structure features associated with high or low prediction confidence, independent of pairing correctness, and evaluated the trade-offs between computational cost and performance gains from increased sampling. To support community-driven mining, benchmarking, and method development, we release a large-scale dataset of ~1.8 million (561,800 VHH-antigen pairings predicted by 3 different tools) complex structures. Our results show that current confidence scores (ipTM) often fail to discriminate cognate from non-cognate interactions (high false positive rate), even with extensive sampling, highlighting key limitations in current antibody-antigen modeling pipelines. This work provides a biologically grounded benchmark for antibody-antigen interface prediction and outlines critical directions for improving computational screening strategies in antibody discovery.
Presenter: ~Eva_Smorodina1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 42
Loading