Keywords: Steering, Sparse Autoencoders, Applications of interpretability, Vision transformers, Causal interventions
TL;DR: We steer the language backbone of a Vision-Language-Action model in the Sparse Autoencoder's latent space, achieving fine-grained control of robotic policies in the robot action space.
Abstract: Vision–language–action (VLA) agents combine perception, language, and control to perform general-purpose tasks, but their internal decision-making is poorly understood and hard to steer. This opacity limits trust and safe deployment in robotics (i.e., embodied AI). In this work, we show that discrete robot actions can be steered by identifying a small number of meaningful features inside the residual stream of a VLA policy. Using a Magma-style model with a ConvNeXt vision encoder and a LLaMA-3-8B-Instruct decoder in the SimplerEnv simulator, we learn behavior directions from contrastive pairs of inputs that differ only in the target action (e.g., open vs. close gripper). Specifically, we use a sparse autoencoder (SAE) fitted to the decoder’s residual stream to construct steering vectors in latent space, which are then decoded back and applied at inference time. This intervention reliably shifts the model’s action choice while preserving overall coherence. Our analysis shows that steering is effective but not perfectly disentangled due to inadvertent activations of related features during steering. These results provide the first evidence that latent-space techniques can steer embodied multimodal policies without retraining. More broadly, this work highlights that mechanistic interpretability techniques (e.g., SAE) can provide handles to control action-level behavior of complex agents.
Submission Number: 157
Loading