Keywords: computational pathology, vision transformers, spatial transcriptomics, multimodal learning, token fusion
TL;DR: We propose a token-fusion transformer that integrates H&E histology and spatial transcriptomics, achieving superior performance on disease state prediction compared to unimodal baselines
Abstract: Tissues can be characterized by their complex morphological structures and molecular programs, as captured by histology images and spatial transcriptomic technologies. Current unimodal foundation models are limited in their ability to reason across morphological and molecular features. We introduce a multimodal transformer architecture that unifies histology images and spatial transcriptomics through token-level fusion. By representing both modalities as interoperable tokens within a shared sequence, our model integrates morphological and molecular features throughout all layers, prioritizing cross-modal relationships over isolated single-modality representations. The resulting token-fusion transformer captures rich morphological and molecular signatures, contextualizing histopathology patterns with molecular information and vice versa. Though preliminary, our results demonstrate that token fusion enhances disease-state prediction and lay the groundwork for multimodal models capable of reasoning jointly over tissue morphology and gene expression.
Submission Number: 66
Loading