GlycoNMR: Dataset and Benchmark of Carbohydrate-Specific NMR Chemical Shift for Machine Learning Research

Published: 10 Jul 2024, Last Modified: 10 Jul 2024Accepted by DMLREveryoneRevisionsBibTeX
Abstract: Molecular representation learning (MRL) is a powerful contribution by machine learning to chemistry as it converts molecules into numerical representations, which is fundamental for diverse biochemical applications, such as property prediction and drug design. While MRL has had great success with proteins and general biomolecules, it has yet to be explored for carbohydrates in the growing fields of glycoscience and glycomaterials (the study and design of carbohydrates). This under-exploration can be primarily attributed to the limited availability of comprehensive and well-curated carbohydrate-specific datasets and a lack of machine learning (ML) techniques tailored to meet the unique problems presented by carbohydrate data. Interpreting and annotating carbohydrate data is generally more complicated than protein data and requires substantial domain knowledge. In addition, existing MRL methods were predominately optimized for proteins and small biomolecules and may not be effective for carbohydrate applications without special modifications. To address this challenge, accelerate progress in glycoscience and glycomaterials, and enrich the data resources of the ML community, we introduce GlycoNMR. GlycoNMR contains two laboriously curated datasets with 2,609 carbohydrate structures and 211,543 annotated nuclear magnetic resonance (NMR) atomic-level chemical shifts that can be used to train ML models for precise atomic-level prediction. NMR data is one of the most appealing starting points for developing ML techniques to facilitate glycoscience and glycomaterials research, as NMR is the preeminent technique in carbohydrate structure research, and biomolecule structure is among the foremost predictors of functions and properties. We tailored a set of carbohydrate-specific features and adapted existing 3D-based graph neural networks to tackle the problem of predicting NMR shifts effectively. For illustration, we benchmark these modified MRL models on the GlycoNMR.
Certifications: Dataset Certification, Reproducibility Certification
Keywords: AI for science, Glycoscience, Graph Neural Network, Nuclear Magnetic Resonance
Changes Since Last Submission: N/A
Changes Since Previous Publication: N/A
Code: https://github.com/Cyrus9721/GlycoNMR
Assigned Action Editor: ~Yue_Zhao13
Submission Number: 42
Loading