Self-Supervised Contextual Representation Learning for Transcriptomic Generative AI

Published: 28 May 2026, Last Modified: 28 May 2026GenBio 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: foundation models, bulk RNA-seq, transcriptomic representation learning, self-supervised learning, transfer learning, generative AI
TL;DR: A self-supervised bulk RNA-seq encoder that learns from gene co-occurrence rather than expression magnitudes, providing normalization-robust representations as a reusable backbone for generative and agentic biological AI systems.
Abstract: Bulk RNA sequencing remains central to translational genomics, yet self-supervised foundation models for bulk data have lagged behind single-cell approaches. Existing bulk transformer models couple representation learning to expression magnitudes through discretization or reconstruction objectives, limiting portability across normalization schemes and cohorts. We introduce \textbf{SpecFormer}, a self-supervised framework that converts each unordered expression profile into a sample-specific gene sequence using term frequency--inverse document frequency (TF-IDF) ordering, then pretrains a transformer encoder via masked gene identity prediction rather than expression-value reconstruction. Pretrained on harmonized TCGA Pan-Cancer data spanning five normalization schemes, SpecFormer achieves 90.83\% accuracy and macro AUC-ROC of 0.997 across 33 cancer types, captures pathway co-regulation structure with mean Pearson correlations of 0.754 and 0.762 across 1,387 PARADIGM pathways, and preserves tissue-level transcriptomic organization on independent GTEx healthy tissue data without retraining. Compared with BulkRNABert, SpecFormer produces markedly richer embedding geometry (effective rank 95.6 vs. 6.3) and more stable histological subtype discrimination, without requiring expression discretization or in-distribution pretraining exposure.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 126
Loading