Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Track: tiny / short paper (2-4 pages excluding references; extended abstract format)
Keywords: multimodal learning, vision transformer, foundation models, histopathology, Xenium, spatial transcriptomics, CLIP
TL;DR: We benchmark uni- and multi-modal models to predict Xenium gene expression from paired H&E, introduce a novel token-level early fusion model, and observe consistent multimodal gains driven by spatially aligned transcript tokens
Abstract: Understanding how molecular programs are embedded within tissue morphology is a central challenge in spatial biology. While vision transformer (ViT) foundation models capture rich histological structure and spatial transcriptomics (ST) provide molecular context, existing multimodal approaches largely rely on contrastive alignment and do not directly learn joint morpho-molecular representations. We introduce an early-fusion multimodal transformer that integrates subcellular Xenium transcript readouts directly into the ViT token stream, enabling fine-grained cross-modal interaction without cell segmentation. We evaluate our approach on a gene prediction task, predicting held-out genes from a targeted Xenium panel given histology and a core gene set. Across a comprehensive benchmark of unimodal baselines and vanilla late-fusion variants, early fusion achieves substantial improvements in gene expression prediction. We further show that performance gains are driven primarily by spatially aligned, token-level transcript representations rather than fusion timing alone. With appropriate transcript tokenization, late fusion can perform on par with early fusion, which explains the limitations observed in prior CLIP-style models. Our results highlight expressive, spatially grounded fusion as a key ingredient for multimodal representation learning in spatial biology.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 89
Loading