Don’t Blind Your VLA: Aligning Visual Representations for OOD Generalization
Keywords: VLA, Generalization, Representations Aligning, fine-tuning
TL;DR: Naive action fine-tuning of VLA models degrades visual representations inherited from pre-trained VLMs; we develop diagnostics to quantify retention and propose a simple alignment method that preserves these features and improves OOD generalization.
Abstract: The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Model (VLM) backbones can endow agents with transferable world knowledge and vision-language reasoning, laying a foundation for action models with broader generalization. Yet when these backbones are adapted to the action modality, it remains unclear to what extent their original VLM representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we design a set of targeted tasks and analytical methods that contrast VLA models with their underlying VLM backbones, isolating changes in visual reasoning induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the retention of visual representations and highlights practical approaches to preserve inherited vision-language capabilities.
Area: Robotics and Control (ROBOT)
Generative A I: I acknowledge that I have read and will follow this policy.
Submission Number: 960
Loading