Causal State Variables in V-JEPA 2 Latents: Discovery, Intervention, and Portability

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Concept Discovery (e.g., SAEs, dictionary learning), Methods (probing, steering, causal interventions), Feature Geometry
Other Keywords: video models, V-JEPA, causal degeneracy, representation portability, subspace extraction, self-supervised learning
TL;DR: We apply a causal intervention pipeline to V-JEPA 2, revealing early-layer causal encoding of motion and discovering extreme "causal degeneracy" where geometrically distinct subspaces hold equivalent functional roles.
Abstract: Video world models trained with Joint Embedding Predictive Architectures (JEPAs) achieve strong performance on motion understanding benchmarks, but whether their latent representations encode causally functional state variables remains unknown. We apply a three-stage causal-statediscovery pipeline—combining L1-regularized probing, class-conditional PCA, difference-in-means subspace extraction, and three families of causal interventions with four matched controls—to thefrozenencoderofV-JEPA2ViT-L(326M parameters, d=1024, 24layers) on a synthetic controlled-sequence dataset of 400 clips across 8 motion directions. V-JEPA 2 encodes motion direction from remarkably early layers (96% dense-probe accuracy at layer 4; 100% by layer 7), using a distributed subspace occupying 57% of latent dimensions. Causal ablation at layer 7 produces effects 43× larger than random direction controls, confirming the identified subspace is causally privileged. The SAS–RCE dissociation—moderate subspace alignment (SAS= 0.35) coexisting with near-perfect retained causal effect (RCE=0.99)—reveals that causal structure is far more stable than its geometric embedding. Findings generalize to complex synthetic stimuli, real Kineticsvideo (5.3×CE ratio), and V-JEPA 2 ViT-H (54× CE ratio with near-perfect cross-architecture CCA alignment). These results provide the first intervention-based evidence that JEPA video models encode motion as a causally functional latent variable, and introduce SAS and RCE as portability metrics for mechanistic interpretability.
Submission Number: 11
Loading