Track: Tiny / short paper (2-4 pages)
Keywords: Single-Cell, Foundation Model, Mechanistic Interpretability, Steering, scFM, Interpretability
TL;DR: SAEs decompose a single-cell foundation model into interpretable biological features that can be steered to causally reprogram cell identity in silico.
Abstract: Single-cell Foundation Models (scFMs) have demonstrated remarkable capability
in learning cellular representations, yet their internal mechanisms remain largely
opaque. In this work, we apply Sparse Autoencoders (SAEs) to the residual stream
of AIDO.Cell, a transformer-based scFM, to decompose its latent space into interpretable biological features. We trained TopK SAEs on the 12th transformer
layer using the PBMC3K dataset as a proof of concept. Using Gene Ontology enrichment to interpret features, we find that ∼ 64% of trained SAE features achieve
statistically significant biological annotations, compared to ∼ 34% from the dense
raw activations. Beyond interpretation, we demonstrate that these features can be
used to functionally ”steer” cell identity: amplifying and suppressing individual
features (e.g., Viral Defense) drives symmetric changes in gene expression, with
steered cell states aligning with the expected biological programs. Furthermore,
we implement a contrastive steering method to automatically discover sparse feature combinations that drive CD4+ T cells towards a CD8+ T cell phenotype.
Inspection of the selected features revealed that the model learned biologically
relevant directions in latent space enabling cell-type steering. Our findings show
that a scFM can learn a decomposable and manipulable model of cell biology,
enabling interpretable in silico experiments.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 38
Loading