Keywords: Protein, biology, representation learning, benchmark, multiscale protein models, structure representation learning, substructures, motifs, domain, protein function, protein structure, protein representation learning
TL;DR: Prevailing protein models ignore the fact that proteins composed of recurrent, modular substructures that are functional and evolutionarily-conserved. We think that should change.
Abstract: Protein representation learning has achieved major advances using large sequence and structure datasets, yet current models primarily operate at the level of individual residues or entire proteins. This overlooks a critical aspect of protein biology: proteins are composed of recurrent, evolutionarily conserved substructures that mediate core molecular functions. Despite decades of curated biological knowledge, these substructures remain largely unexploited in modern protein models. We introduce Magneton, an integrated environment for developing substructure-aware protein models. Magneton provides (1) a large-scale dataset of 530,601 proteins annotated with over 1.7 million substructures spanning 13,075 types, (2) a training framework for incorporating substructures into existing models, and (3) a benchmark suite of 13 tasks probing residue-, substructure-, and protein-level representations. Using Magneton, we develop substructure-tuning, a supervised fine-tuning method that distills substructural knowledge into pretrained protein models. Across state-of-the-art sequence- and structure-based models, substructure-tuning improves function-related tasks while revealing that substructural signals are complementary to global structural information.
The Magneton environment, datasets, and substructure-tuned models are all openly available.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 22257
Loading