From Spatial Transcriptomics to Tokens: Generative Pre-Training with Byte-Pair Encoding

ICLR 2026 Conference Submission14028 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Representation Learning, Spatial Transcriptomics, Tokenization, Byte-Pair Encoding
TL;DR: STBPE converts spatial transcriptomics into token representations, enhancing representation learning for cell identity and disease mechanisms.
Abstract: Generative pre-trained models have achieved remarkable success in natural language processing and computer vision, while spatial single-cell transcriptomics has emerged as a powerful tool for investigating disease mechanisms. The current methods largely overlook the impact of RNA spatial organization on cellular identity and disease processes, which may lead to the loss of RNA co-localization information, incomplete spatial transcriptome analysis, and insufficient investigation of disease mechanisms, thereby missing critical strategies for clinical diagnosis. To address the above issues, we propose STBPE (Spatial Transcriptomics Byte Pair Encoding),a pre-training framework that focuses on subcellular resolution. This framework innovatively integrates “spatially aware byte pair encoding strategies”, by converting subcellular localization information of RNA within a single cell into serialized token units, achieving precise digital representation of RNA spatial distribution patterns. Specifically, it first uses a spatial omics data-driven word segmentation algorithm to encode the spatial coordinates and transcript features of RNA into a unified byte pair sequence. Then, it adopts the BERT style masked self supervised learning paradigm to randomly mask partially spatially aware labels and reconstruct the original sequence, forcing the model to learn deep embedding representations that contain spatial position information. This design enables STBPE to capture the potential association between RNA spatial distribution and gene expression, significantly improve cell type annotation, uncover co-localized RNAs associated with cellular identity from a new perspective, and pave the way for building multimodal foundation models that integrate spatial transcriptomics with natural language.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 14028
Loading