2DNMRGym: An Annotated Experimental Dataset for Atom-Level Molecular Representation Learning in 2D NMR via Surrogate Supervision
Keywords: Molecular Representation Learning, Graph Neural Networks (GNN), Graph Transformers, Self-Supervised Learning, Structure Elucidation, Drug Discovery, Data-Centric AI, Benchmarking & Reproducibility
TL;DR: 2DNMRGym: the first large-scale, open-source 2D NMR (HSQC) dataset with benchmarks, enabling node-level graph regression; a surrogate-supervision setup scales training and ensures unbiased evaluation.
Abstract: Two-dimensional (2D) Nuclear Magnetic Resonance (NMR) spectroscopy, particularly Heteronuclear Single Quantum Coherence (HSQC) spectroscopy, plays a critical role in elucidating molecular structures, interactions, and electronic properties. However, accurately interpreting 2D NMR data remains labor-intensive and error-prone, requiring highly trained domain experts, especially for complex molecules. Machine Learning (ML) holds significant potential in 2D NMR analysis by learning molecular representations and recognizing complex patterns from data. However, progress has been limited by the lack of large-scale and high-quality annotated datasets. In this work, we introduce $\textbf{2DNMRGym}$, the first annotated experimental dataset designed for ML-based molecular representation learning in 2D NMR. It includes over 22,000 HSQC spectra, along with the corresponding molecular graphs and SMILES strings. Uniquely, 2DNMRGym adopts a surrogate supervision setup: models are trained using algorithm-generated annotations derived from a previously validated method and evaluated on a held-out set of human-annotated gold-standard labels. This enables rigorous assessment of a model’s ability to generalize from imperfect supervision to expert-level interpretation. We provide benchmark results using a series of 2D and 3D GNN and GNN transformer models, establishing a strong foundation for future work. 2DNMRGym supports scalable model training and introduces a chemically meaningful benchmark for evaluating atom-level molecular representations in NMR-guided structural tasks. Our data and code is open-source and available at: https://github.com/siriusxiao62/2DNMRGym.
Neurips Id: 1950
Neurips Title: 2DNMRGym: An Annotated Experimental Dataset for Atom-Level Molecular Representation Learning in 2D NMR via Surrogate Supervision
Neurips Authors: Yunrui Li, Hao Xu, Pengyu Hong
Neurips Pdf: pdf
Neurips Review Document: pdf
Changes Made: pdf
Confirmation: By submitting, authors affirm that all information provided is complete, truthful, and anonymized, including the NeurIPS-25 paper ID, author list, title, original PDF, reviews, meta-review, and author responses. The organizing committee reserves the right to verify the accuracy of all submitted materials after acceptance, in coordination from the NeurIPS-25 Organizing Committee. Providing false or misleading information constitutes a violation of scientific integrity. Submissions found to violate this policy will be desk-rejected.
Submission Number: 5
Loading