Artificial Intelligence for Spatial Transcriptomics: A Scoping Review of Architectures and Models

Agampreet Saini; Supragya Gandotra; Abhijit Kumar; Pratik Dutta; Tirthankar Ghosal

Artificial Intelligence for Spatial Transcriptomics: A Scoping Review of Architectures and Models

Agampreet Saini, Supragya Gandotra, Abhijit Kumar, Pratik Dutta, Tirthankar Ghosal

Published: 09 Oct 2025, Last Modified: 09 Oct 2025NeurIPS 2025 Workshop ImageomicsEveryoneRevisionsBibTeXCC BY 4.0

Submission Track: Short papers presenting ongoing research or work submitted to other venues (up to 5 pages, excluding references)

Keywords: spatial transcriptomics, representation learning, deep learning, multimodal fusion, biological vision, foundation models, histology, survey

TL;DR: A comprehensive survey of over 40 deep learning frameworks for spatial transcriptomics, organized by core tasks and highlighting the shift towards multimodal representation learning and foundation models.

Abstract: Learning meaningful representations from multimodal spatial transcriptomics data—which integrates histology images with gene expression—is a fundamental challenge in biological vision. Spatial transcriptomics provides this data, enabling the mapping of the transcriptome onto tissue sections. This paper presents a thorough survey of representation learning in spatial omics, critically comparing over 40 deep learning frameworks. We group these models by the core tasks their learned representations are designed to solve: cell type deconvolution, spatial domain identification, gene expression imputation, 3D tissue reconstruction, and cell-cell interaction simulation. Special attention is given to the dominant architectures for representation learning in this domain, including graph neural networks, contrastive learning, and multimodal fusion methods. We evaluate representative models such as ADCL, CellMirror, and MuST for the scalability, interpretability, and biological impact of their learned embeddings. The survey also addresses common challenges that hinder representation learning, including spatial noise, modality imbalance, and low-resolution data. Finally, we outline future directions centered on building foundation models for spatial biology and improving 3D alignment. This review provides a critical guide for researchers developing foundational and task-specific representations from multimodal spatial data.

Submission Number: 16

Loading