Keywords: Benchmarks, Glycan, ML, property prediction, interaction prediction
TL;DR: We present a new, actively maintained and updated benchmark suite for glycan property prediction and evaluate a divers collection of ML models on the datasets.
Abstract: Glycan property prediction is an increasingly popular area of machine learning research. Supervised learning approaches have shown promise in glycan modeling; however, the current literature is fragmented regarding datasets and standardized evaluation techniques, hampering progress in understanding these complex, branched carbohydrates that play crucial roles in biological processes. To facilitate progress, we introduce GlycoGym, a comprehensive benchmark suite containing six biologically relevant supervised learning tasks spanning different domains of glycobiology: glycosylation linkage identification, tissue expression prediction, taxonomy classification, tandem mass spectrometry fragmentation prediction, lectin-glycan interaction modeling, and structural property estimation. We curate tasks into specific training, validation, and test splits using multi-class stratification to ensure that each task tests biologically relevant generalization that transfers to real-life glycan property prediction scenarios. We benchmark a diverse range of approaches to glycan representation learning, spanning fingerprint-based baselines, language models operating on IUPAC-condensed sequences, and graph neural networks explicitly designed for glycan topology, including Sweet-Net, GLAMOUR, and the recent GIFFLAR architecture. We find that specialized glycan encoders consistently outperform simple baselines for the more complex tasks. GlycoGym will help the machine learning community to focus their efforts on scientifically relevant glycan prediction problems and will be regularly updated through integration with the glycowork Python package. Toward this end, all data and code used to run these experiments will be made available at GitHub and Zenodo
Primary Area: datasets and benchmarks
Submission Number: 20461
Loading