DockedAC: A Dataset with Comprehensive 3D Protein-ligand Complexes for Activity Cliff Analysis

12 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY-SA 4.0
Keywords: Deep learning dataset, Molecular property prediction, AI-aided drug discovery
Abstract: Artificial intelligence has become a crucial tool in drug discovery, excelling in tasks such as molecular property prediction. However, an activity cliff---a phenomenon where a minor structural modification to a molecule leads to a significant change in its biological activity---poses a challenge in predictive modeling. The activity cliff depends on the interaction between the target and the ligand, which is largely overlooked by previous ligand-centric studies. However, the limited availability of activity cliff data for target-ligand 3D complexes constrains the predictive power of modern deep learning models. In this paper, we introduce DockedAC, a new dataset incorporating the protein target and 3D complex structure information for studying the problem of activity cliffs. By matching protein binding information and ligand bioactivity, we employ molecular docking to generate the complex structure for each activity value. The DockedAC dataset contains 82,836 activity data across 52 protein targets annotated with activity cliff information. This dataset represents a significant step toward large-scale activity cliff research using 3D complex structures. We benchmark the dataset with traditional machine learning and deep learning approaches. Our data and benchmark platform are available online.
Croissant File: json
Dataset URL: https://zenodo.org/records/14351603
Code URL: https://anonymous.4open.science/r/DockedAC
Primary Area: AL/ML Datasets & Benchmarks for health sciences (e.g. climate, health, life sciences, physics, social sciences)
Submission Number: 2360
Loading