Keywords: Knowledge Graphs, Heterogeneous Graph Transformer, Foundation Models, Metabolic Engineering, Self-Supervised Learning
TL;DR: A Heterogeneous Graph Transformer outperforms tabular models at titre prediction across organsims, targets and genetic modifications.
Abstract: Designing microbial strains that produce high-value chemicals at
commercially viable titers remains a central challenge in metabolic
engineering. Existing computational approaches either rely on
stoichiometric constraint-based models that cannot learn from
experimental data, or apply tabular machine learning to hand-crafted
features that discard the relational structure of biological
knowledge. We present Canopy, a heterogeneous graph foundation
model that integrates ten public and proprietary data sources into a
unified knowledge graph (KG) of 6.9M nodes across 13 types and 34 edge types,
covering genes, proteins, metabolites, reactions, pathways, strains,
and fermentation experiments. Node features are encoded through
domain-specific foundation models (ESM-2 for protein sequences,
MoLFormer for chemical SMILES, and PubMedBERT for biomedical
text), yielding a multi-modal representation within a single graph.
We pretrain a Heterogeneous Graph Transformer (HGT) augmented with SignNet
positional encodings, Jumping Knowledge aggregation, and virtual
nodes using four self-supervised objectives (link
prediction, masked node modelling, distance prediction, and
contrastive experiment clustering), balanced via learned homoscedastic
uncertainty weighting. On the downstream task of fermentation titer
prediction, frozen Canopy embeddings achieve $R^{2} = 0.41$ with
a lightweight probe, outperforming tabular baselines (best
$R^{2} = 0.13$) and homogeneous GNN variants.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 53
Loading