Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Track: long paper (4–8 pages excluding references)
Keywords: geometric analysis, local intrinsic dimensionality, foundation models, representation geometry, zero-shot prediction, manifold structure
TL;DR: The manifold knows what the prediction head forgets. Pathogenic variants leave low-dimensional spots in embedding space. One model learns this geometry internally, then destroys it before output.
Abstract: Log-likelihood ratios (LLR) have emerged as a standard probe of biological founda-
tion models’ ability to predict variant effects. However, it remains unclear whether
the latent manifold of these models already encodes the relevant geometric structure
of variant effects. We investigate this question across the central dogma by com-
puting local intrinsic dimensionality (LID) around reference and alternative variant
embeddings generated by EVO2 (DNA), ORTHRUS (RNA), and ESM3 (protein).
This allows us to compare the geometric neighbourhoods of pathogenic and benign
variants in each model’s embedding space. On 105,224 ClinVar missense variants,
we find that the optimal scoring method is modality-dependent: ESM3 protein
achieves the highest LID-based AUROC (0.738), exceeding its own LLR (0.629),
while EVO2-7B DNA achieves the highest LLR (0.878). Dense per-layer analy-
sis reveals three distinct information-processing regimes: stable geometric signal
from the first layer onward (ESM3), monotonic buildup (ORTHRUS), and cyclic
build-and-flush phases tied to the convolution–attention architecture (EVO2). The
geometric signal persists after controlling for evolutionary conservation (57–61%
retained) and is positive across all conservation quartiles, indicating that foundation
models learn constraint structure beyond what conservation scores capture.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 90
Loading