Track: long paper (up to 8 pages)
Keywords: VLA models, mechanistic interpretability, sparse autoencoders, multimodal fusion, activation steering, robot manipulation
TL;DR: On three VLA architectures, we show that fine-tuned vision-language-action models are effectively vision-only policies with vestigial language pathways, that per-token sparse autoencoders (not mean-pooled) enable causal feature identification
Abstract: Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We present the first cross-architecture mechanistic study of visuomotor policies, applying activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M to 7B parameters across 394,000+ rollout episodes on four benchmarks. Three findings emerge as architecture-independent properties of behavior-cloned policies.
First, the visual pathway dominates action generation. Injecting baseline activations into null-prompt episodes recovers near-identical behavior across architectures, while cross-task injection universally degrades destination task success. Critically, trajectory analysis reveals that this ``failure'' masks successful behavioral transfer: robots execute source-task motor programs in the wrong scene (99.8\% of X-VLA episodes align with the source trajectory), exposing spatially bound action sequences tied to scene coordinates rather than abstract task representations. Second, language sensitivity depends on task structure, not model design. When the visual scene uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (e.g., X-VLA libero_goal: 94\% $\to$ 10\% under wrong prompts vs.libero\_object: 60--100\% regardless). Third, in all three multi-pathway architectures ($\pi$0.5, SmolVLA, GR00T), expert pathways consistently encode motor programs while VLM pathways encode goal semantics, with expert injection causing $2\times$ greater behavioral displacement.
Methodologically, we train 424 SAEs and find that per-token processing is generally essential, though architecture modulates this relationship in unexpected ways. Contrastive feature identification recovers 82+ manipulation concepts, and causal ablation reveals that sensitivity (28--92\% zero-effect rates) does not follow representation width. We release Action Atlas (https://action-atlas.com) for interactive exploration of VLA representations across all six models.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 75
Loading