PLINDER: The protein-ligand interactions dataset and evaluation resource

Published: 17 Jun 2024, Last Modified: 16 Jul 2024ML4LMS PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Protein-Ligand, Dataset, Interactions, Machine Learning, Docking
Abstract: Protein-ligand interactions (PLI) are foundational to small molecule drug design. With computational methods striving towards experimental accuracy, there is a critical demand for a well-curated and diverse PLI dataset. Existing datasets are often limited in size and diversity, and commonly used evaluation sets suffer from training information leakage, hindering the realistic assessment of method generalization capabilities. To address these shortcomings, we present PLINDER, the largest and most annotated dataset to date, comprising 449,383 PLI systems, each with over 500 annotations, similarity metrics at protein, pocket, interaction and ligand levels, and paired unbound (apo) and predicted structures. We propose an approach to generate training and evaluation splits that minimizes task-specific leakage and maximizes test set quality, and compare the resulting performance of DiffDock when re-trained with different kinds of splits.
Poster: pdf
Submission Number: 106
Loading