SpecFormer: Normalization-Robust Transcriptomic Representations for Multi-Modal Foundation Models
Keywords: foundation models, bulk RNA-seq, transcriptomic representation learning, self-supervised learning, masked language modeling
TL;DR: We introduce TF-IDF gene ordering and masked gene identity prediction as a normalization-independent pretraining strategy for whole-transcriptome bulk RNA-seq foundation models.
Abstract: Bulk RNA sequencing remains central to translational genomics, yet
self-supervised foundation models for bulk data have lagged behind
single-cell approaches. Existing bulk transformer models couple
representation learning to expression magnitudes through discretization
or reconstruction objectives, limiting portability across normalization
schemes and cohorts. We introduce \textbf{SpecFormer}, a self-supervised
framework that converts each unordered expression profile into a
sample-specific gene sequence using term frequency--inverse document
frequency (TF-IDF) ordering, then pretrains a transformer encoder via
masked gene identity prediction rather than expression-value
reconstruction. Pretrained on harmonized TCGA Pan-Cancer data spanning
five normalization schemes, SpecFormer achieves 90.83\% accuracy and
macro AUC-ROC of 0.997 across 33 cancer types, captures pathway
co-regulation structure with mean Pearson correlations of 0.754 and
0.762 across 1,387 PARADIGM pathways, and preserves tissue-level
transcriptomic organization on independent GTEx healthy tissue data
without retraining. Compared with BulkRNABert, SpecFormer produces
markedly richer embedding geometry (effective rank 95.6 vs. 6.3) and
more stable histological subtype discrimination, without requiring
expression discretization or in-distribution pretraining exposure. By decoupling representations from expression scale and normalization, SpecFormer provides a portable transcriptomic backbone suited for integration into multi-modal foundation models that jointly reason over heterogeneous omics, clinical, and imaging data.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 60
Loading