Bento: Benchmarking Classical and AI Docking on Drug Design–Relevant Data

Bento: Benchmarking Classical and AI Docking on Drug Design–Relevant Data

ICLR 2026 Conference Submission18619 Authors

19 Sept 2025 (modified: 02 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: molecular docking, co-folding, protein-ligand interaction, drug design

TL;DR: AI protein-ligand interaction prediction methods are comparable to physics-aware docking for drug-design material but fail to generalize to unseen protein pockets, with Gnina emerging as the most reliable performer.

Abstract: Recent advances in artificial intelligence have introduced deep learning and co-folding approaches for predicting protein-ligand complexes, raising the question of their applicability and how they compare with classical docking methods. In this work, we present a thorough benchmarking study of eleven tools for protein-ligand interaction prediction, spanning classical molecular docking methods, deep learning-based models, and co-folding algorithms. While most related benchmarking efforts primarily assess the generalization capacity, we extend the analysis to also evaluate the performance on drug design-relevant data and across different classes of protein-ligand complexes. Here, we introduce \textsc{Bento}, a comprehensive benchmark that evaluates 11 tools for protein-ligand interaction prediction -- both established and recently developed -- across four test datasets and multiple derived subsets in a pocket-aware setup. We show that 1) careful dataset curation is essential -- filtering by pocket structural similarity and controlling ligand complexity exposes generalization failures that are obscured in conventional benchmarks; 2) classical and deep learning-based docking tools perform similarly well on drug-like ligands, making them comparably useful for virtual screening, with physics-based methods offering a clear advantage in speed; 3) co-folding tools outperform other approaches on structurally complex ligands, whereas most methods achieve similar accuracy on regular small molecules; and 4) all methods struggle to generalize to unseen pockets, with deep learning models being the most prone to overfitting. Overall, our results show that while current docking and DL-based approaches are reliable for many drug-design-relevant scenarios, genuine pocket-level generalization remains an open challenge. \textsc{Bento} provides a rigorous and transparent framework for diagnosing these limitations and guiding the development of more robust protein-ligand prediction models.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 18619

Loading