Large-Scale Benchmarking of Gene and Expression Encoding Strategies for Single-Cell Foundation Models

Published: 02 Mar 2026, Last Modified: 10 Mar 2026Gen² 2026 PosterEveryoneRevisionsCC BY 4.0
Track: Full / long paper (5-8 pages)
Keywords: Single-cell foundation models, Gene and expression encoding, R epresentation learning, benchmarking
TL;DR: We benchmark gene and expression encoding strategies for single-cell transformers at scale, finding that learned gene embeddings with soft binning substantially outperform ESM-2 embeddings and hard binning across all evaluation metrics.
Abstract: Single-cell foundational models face a critical design choice: how to encode gene identities and expression values for transformer architectures. We present a large-scale systematic evaluation comparing learned versus pretrained protein model (ESM-2) gene embeddings, and four expression encoding strategies: discrete binning, soft binning, logarithmic binning, and continuous encoding. We train 8 model configurations from scratch on 10 million cells across 100 diverse datasets and evaluate on batch correction, biological preservation, classification, and reconstruction tasks across 26 tissue datasets. Our results demonstrate that (1) task-specific learned gene embeddings substantially outperform ESM-2 embeddings across all metrics, achieving an average 29\% relative improvement; (2) soft binning achieves optimal performance with learned embeddings, achieving 91.6\% classification accuracy and 86.0\% macro F1 score; (3) hard binning consistently degrades performance across all evaluation metrics. The best configuration—learned embeddings with soft binning—outperforms all alternatives, providing clear guidance for single-cell model design.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 21
Loading