Keywords: JEPA, spatial transcriptomics, self-supervised learning, foundation models
TL;DR: ST-JEPA is the first joint-embedding predictive architecture for spatial transcriptomics, converting cell neighborhoods into multi-scale transformer sequences that achieve state-of-the-art niche identification and batch integration.
Abstract: Spatial transcriptomics enables scalable measurement of gene expression at single-cell resolution while capturing the spatial locations of cells within tissue. The resulting data is typically treated as a tabular matrix—gene expression counts paired with spatial coordinates—which does not naturally map to the inputs expected by transformers. Existing methods are largely task-specific, while recent foundation models either omit spatial context during pretraining (Nicheformer) or rely on contrastive objectives (Novae). We introduce ST-JEPA, the first joint embedding predictive architecture for spatial transcriptomics. ST-JEPA converts cell-level spatial data into structured transformer sequences via a multi-scale graph tokenization at three biological resolutions—cellular neighborhood, cell, and gene—producing hierarchical embeddings for diverse downstream tasks. Trained on mouse brain data spanning two technologies with non-overlapping gene panels, ST-JEPA achieves the best niche identification (weighted NMI=0.67) and the best batch integration (iLISI) among methods that perform well on niche identification, without explicit integration objectives. Systematic ablations across six design axes provide practical guidance for self-supervised learning on spatial transcriptomics data.
Submission Number: 16
Loading