ProteinConformers: Benchmark Dataset for Simulating Protein Conformational Landscape Diversity and Plausibility

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: protein folding, conformational space, protein structure hallucination, protein conformation generation, dataset, benchmark
TL;DR: ProteinConformers is a novel large benchmark dataset enabling better evaluation of protein structure generators by capturing realistic conformational diversity and plausibility and proposing new assessment metrics.
Abstract: Understanding the conformational landscape of proteins is essential for elucidating protein function and facilitating drug design. However, existing protein conformation benchmarks fail to capture the full energy landscape, limiting their ability to evaluate the diversity and physical plausibility of AI-generated structures. We introduce ProteinConformers, a large-scale benchmark dataset comprising over 381,000 physically realistic conformations for 87 CASP targets. These were derived from more than 40,000 structural decoys via extensive all-atom molecular dynamics simulations totaling over 6 million CPU hours. Using this dataset, we propose novel metrics to evaluate conformational diversity and plausibility, and systematically benchmark six protein conformation generative models. Our results highlight that leveraging large-scale protein sequence data can enhance a model’s ability to explore conformational space, potentially reducing reliance on MD-derived data. Additionally, we find that PDB and MD datasets influence model performance differently, current models perform well on inter-atomic distance prediction but struggle with inter-residue orientation generation. Overall, our dataset, evaluation metrics, and benchmarking results provide the first comprehensive foundation for assessing generative models in protein conformational modeling. Dataset and instructions are available at https://huggingface.co/ datasets/Jim990908/ProteinConformers/tree/main. Codes are stored at https://github.com/auroua/ProteinConformers. An interactive website locates at https://zhanggroup.org/ProteinConformers.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/Jim990908/ProteinConformers/tree/main
Code URL: https://github.com/auroua/ProteinConformers
Supplementary Material: pdf
Primary Area: AL/ML Datasets & Benchmarks for life sciences (e.g. climate, health, life sciences, physics, social sciences)
Submission Number: 1323
Loading