STAMP （Submitted to ICLR 2026）
================================

Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

<img src="docs/logo.png" width="200px" align="right" />

**Abstract:** Recent years have witnessed remarkable progress in multimodal learning within computational pathology. Existing models primarily rely on vision and language modalities; however, language alone lacks molecular specificity and offers limited pathological supervision, leading to representational bottlenecks. In this paper, we propose STAMP, a **S**patial **T**ranscriptomics-**A**ugmented **M**ultimodal **P**athology representation learning framework that integrates spatially-resolved gene expression profiles to enable molecule-guided joint embedding of pathology images and transcriptomic data. Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations. Incorporating spatial context and multi-scale information further enhances model performance and generalizability. To support this, we constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset to date, and trained a spatially-aware gene encoder on this resource. Leveraging hierarchical multi-scale contrastive alignment and cross-scale patch localization mechanisms,STAMP effectively aligns spatial transcriptomics with pathology images, capturing spatial structure and molecular variation. We validate STAMP across six datasets and four downstream tasks, where it consistently achieves strong performance. These results highlight the value and necessity of integrating spatially resolved molecular supervision for advancing multimodal learning in computational pathology. The code is included in the supplementary materials. The pretrained weights and SpaVis-6M will be released for community development after reviewing the manuscript.

---

<img src="docs/framework.png" width="100%" align="center" />

## SpaVis-6M: The Largest Spatial Transcriptomics Dataset Based on 10X Visium

SpaVis-6M encompasses 1982 slices derived from 35 distinct organs and 262 studies/datasets. This collection comprises 5.75 million spatial transcriptomic gene expression profiles. **The metadata of SpaVis-6M is located at [SpaVis_6M.csv](./src/dataset_info/SpaVis_6M.csv)**

<img src="docs/Dataset1.png" width="100%" align="center" />
<p> </p>
<img src="docs/Dataset2.png" width="100%" align="center" />

## STAMP: **S**patial **T**ranscriptomics-**A**ugmented **M**ultimodal **P**athology Representation Learning

### Step 1: Tokenize Gene Data (Without Spatial Information)

```bash
python src/tokenize_gene_wo_spatial.py
```

### Step 2: Train Gene Encoder (Gene Encoder Pretraining - Phase One)

```bash
python src/train_gene_encoder.py gpuid_1 gpuid_2 gpuid_3 gpuid_4
```

### Step 3: Tokenize Gene Data (With Spatial Information - Spatial-aware Sampling)

```bash
python src/tokenize_gene_w_spatial.py
```

### Step 4: Train Spatial-aware Gene Encoder (Gene Encoder Pretraining - Phase Two)

Before running the code, the parameters in [_config_train_gene_encoder.py](./src/config_files/_config_train_gene_encoder.py) should be adjusted.

```bash
python src/train_gene_encoder.py gpuid_1 gpuid_2 gpuid_3 gpuid_4
```

### Step 5: Tokenize Multimodal Data

```bash
python src/tokenize_multi.py
```

### Step 6: Train STAMP

```bash
python src/train_stamp.py gpuid_1 gpuid_2 gpuid_3 gpuid_4
```

### Step 7: Tokenize Downstream Data

```bash
python src/tokenize_downstream.py 
```

### Step 8: Finetune STAMP on Downstream Datasets

Only used for the task of gene expression prediction through contrastive learning

```bash
python src/finetune_stamp.py --project PSC/HHK/HER2+
```
