SpecFormer: Normalization-Robust Transcriptomic Representations for Multi-Modal Foundation Models

08 May 2026 (modified: 28 May 2026)Submitted to ICML 2026 FM4LS WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: foundation models, bulk RNA-seq, transcriptomic representation learning, self-supervised learning, masked language modeling
TL;DR: We introduce TF-IDF gene ordering and masked gene identity prediction as a normalization-independent pretraining strategy for whole-transcriptome bulk RNA-seq foundation models.
Abstract: Bulk RNA sequencing remains central to translational genomics, yet self-supervised foundation models for bulk data have lagged behind single-cell approaches. Existing bulk transformer models couple representation learning to expression magnitudes through discretization or reconstruction objectives, limiting portability across normalization schemes and cohorts. We introduce \textbf{SpecFormer}, a self-supervised framework that converts each unordered expression profile into a sample-specific gene sequence using term frequency--inverse document frequency (TF-IDF) ordering, then pretrains a transformer encoder via masked gene identity prediction rather than expression-value reconstruction. Pretrained on harmonized TCGA Pan-Cancer data spanning five normalization schemes, SpecFormer achieves 90.83\% accuracy and macro AUC-ROC of 0.997 across 33 cancer types, captures pathway co-regulation structure with mean Pearson correlations of 0.754 and 0.762 across 1,387 PARADIGM pathways, and preserves tissue-level transcriptomic organization on independent GTEx healthy tissue data without retraining. Compared with BulkRNABert, SpecFormer produces markedly richer embedding geometry (effective rank 95.6 vs. 6.3) and more stable histological subtype discrimination, without requiring expression discretization or in-distribution pretraining exposure. By decoupling representations from expression scale and normalization, SpecFormer provides a portable transcriptomic backbone suited for integration into multi-modal foundation models that jointly reason over heterogeneous omics, clinical, and imaging data.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 60
Loading