Keywords: Few-Shot Learning, Robotics, Mechanistic Interpretability
TL;DR: PhysicalFT selectively finetunes only the attention heads responsible for physical reasoning, achieving superior robotic adaptation with significantly fewer parameters than standard approaches.
Abstract: Vision-Language Action Models (VLAs) promise to extend the remarkable success of foundation models in vision and language to robotics.
Yet, unlike those models, usable VLAs for robotics require finetuning to contend with complex physical factors like robot embodiment, environment characteristics, and spatial relationships.
Current fine-tuning methods adapt the same set of parameters regardless of the visual, linguistic, and physical characteristics of a particular task.
Inspired by functional specificity in neuroscience, we hypothesize that it is \em more effective to fine-tune components of model representations specific to a given task.
In this work, we introduce Robotic Steering, a novel mechanistic finetuning approach that identifies task-specific representations in the attention-head space to selectively adapt VLAs.
In particular, we use few-shot examples to identify and selectively finetune only the VLA attention heads that align with the specific physical, visual, and linguistic requirements of a task. Through comprehensive on-robot evaluations using a Franka Emika robot arm, we demonstrate that Robotic Steering matches or outperforms full-head LoRA across all tested tasks. Crucially, Robotic Steering demonstrates superior robustness under environmental and task variations compared to standard LoRA finetuning, while enabling faster, more compute-efficient, and interpretable experimentation. Grounded in mechanistic interpretability, Robotic Steering offers a controllable, efficient, and generalizable framework for adapting VLAs to the diverse physical requirements of robot tasks.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 9643
Loading