ST-Align: Multi-Scale Image-Gene Foundation Modeling for Spatial Transcriptomics via Spot-Niche Alignment
Keywords: spatial transcriptomics, foundation models, multimodal learning, contrastive learning, image-gene alignment, multi-scale representation learning, digital pathology
TL;DR: A spot-niche multi-scale image-gene foundation model for spatial transcriptomics that improves zero-shot spatial domain identification and image-to-gene prediction.
Abstract: Spatial transcriptomics (ST) measures genome-wide gene expression together with tissue morphology at spatially indexed locations, enabling region-resolved molecular analysis that is not accessible to bulk sequencing or histology alone. Learning robust multimodal representations from ST is challenging because spot images are low resolution, spot-level gene vectors reflect mixed-cell composition, and biologically meaningful signal often depends on local neighborhoods rather than isolated spots.
We present ST-Align, a domain-adapted image–gene pretraining framework for ST that injects an explicit spot–niche inductive bias. ST-Align represents each spot together with a local neighborhood (niche) and aligns image and gene representations at three levels: spot-level image–gene alignment, niche-level alignment between neighborhood morphology and aggregated gene expression, and a cross-scale spot–niche objective that couples local and tissue-context signals.
We pretrain ST-Align on 1.3 million spot-level image–gene pairs from 573 curated human 10x Visium slides (STimage-1K4M) and evaluate (i) zero-shot transfer for spatial domain identification on six held-out human brain slices and (ii) image-to-gene prediction under patient-level splits. ST-Align improves spatial domain identification by 28.7% over the best multimodal baseline (ARI 0.340 vs. 0.256) and reduces gene prediction error by 16.5% (MSE 0.168 vs. 0.184), with particularly strong gains for non-laminar genes. Overall, these results support multi-scale spot–niche alignment as a useful design principle for ST representation learning in human 10x Visium data. Broader validation across tissues and ST technologies remains future work.
Submission Number: 78
Loading