Double Vision: Unifying Morphology and Gene Expression with a Multimodal Transformer

Adriano Martinelli; Bernd Illing; Alice Driessen; Robert Berke; Raphael Gottardo; Marianna Rapsomaniki; Fei Tang

Double Vision: Unifying Morphology and Gene Expression with a Multimodal Transformer

Adriano Martinelli, Bernd Illing, Alice Driessen, Robert Berke, Raphael Gottardo, Marianna Rapsomaniki, Fei Tang

06 Sept 2025 (modified: 06 Oct 2025)Submitted to NeurIPS 2025 2nd Workshop FM4LSEveryoneRevisionsBibTeXCC BY 4.0

Keywords: computational pathology, vision transformers, spatial transcriptomics, multimodal learning, token fusion

TL;DR: We propose a token-fusion transformer that integrates H&E histology and spatial transcriptomics, achieving superior performance on disease state prediction compared to unimodal baselines

Abstract: Tissues can be characterized by their complex morphological structures and molecular programs, as captured by histology images and spatial transcriptomic technologies. Current unimodal foundation models are limited in their ability to reason across morphological and molecular features. We introduce a multimodal transformer architecture that unifies histology images and spatial transcriptomics through token-level fusion. By representing both modalities as interoperable tokens within a shared sequence, our model integrates morphological and molecular features throughout all layers, prioritizing cross-modal relationships over isolated single-modality representations. The resulting token-fusion transformer captures rich morphological and molecular signatures, contextualizing histopathology patterns with molecular information and vice versa. Though preliminary, our results demonstrate that token fusion enhances disease-state prediction and lay the groundwork for multimodal models capable of reasoning jointly over tissue morphology and gene expression.

Submission Number: 66

Loading