Beyond Static Snapshots: A Large-Scale Dataset for Dynamics-Aware Protein-Nucleic Acid Modeling

Published: 30 May 2026, Last Modified: 01 Jun 2026ICML2026-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.
Track: Track 2: Dataset Proposal Competition
Keywords: molecular dynamics, protein-nucleic acid, dataset, structural biology, machine learning, binding affinity, binding site prediction
Abstract: Protein-nucleic acid (NA) interactions govern essential biological processes, from transcription and translation to viral replication. Yet, AI models for protein-NA downstream tasks are limited by the scarcity of high-quality dynamic data, as nearly all current datasets provide only static structures. To overcome this, we propose the first large-scale, open repository of molecular dynamics (MD) simulations for protein-NA complexes. It curates $\sim$1,000 high-resolution protein-RNA and protein-DNA complexes from the PDB and generates 3 $\mu$s of dynamics per system ($3 \times 1$ $\mu$s replicates), amounting to $\sim$3 ms of total simulation time. Each entry provides atomistic trajectories, post-processed features, and metadata, and all data will be integrated as a new MDDB node. This resource enables the development and systematic evaluation of dynamic-aware AI models, with binding site prediction as the primary target task, focusing on fast-timescale conformational variability and the associated local fluctuations, side-chain plasticity, and induced-fit rearrangements that static structures cannot represent. Crucially, while MD-trained generative conformational samplers have recently emerged for proteins, protein-NA complexes have lacked the large-scale atomistic MD training data needed to extend such approaches. This dataset fills that gap, opening the door to all-atom generative conformational samplers, protein-NA complex sampling, and nucleic-acid-targeting protein design.
Submission Number: 98
Loading