VLAs Are Confined yet Capable of Generalizing to Novel Tasks

Quanyi Li

VLAs Are Confined yet Capable of Generalizing to Novel Tasks

Quanyi Li

18 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Language Action Models, Generalization, Interpretability, Benchmark

TL;DR: VLAs fail to generalize, while it holds the potential to do so.

Abstract: Vision-language-action models (VLAs) often achieve high performance on demon- strated tasks but struggle significantly when required to extrapolate, recombining skills used in different tasks in novel ways. For instance, VLAs might successfully put the cream cheese in the bowl and put the bowl on top of the cabinet, yet still fail to put the cream cheese on top of the cabinet. This motivates us to investigate whether VLAs merely overfit to demonstrated tasks or still hold the potential to extrapolate. Our study uses text latent as the ingredient; it is a task-specific vector derived from the models’ hidden states. It thus encodes semantics necessary for completing a task and can be used to reconstruct the associated task behavior by writing it to the model’s residual stream. Furthermore, we find that skills used in distinct tasks can be combined to produce novel behaviors by blending their respective text latent. Applying this to π0, we increase its success rate from 9% to 83% on the proposed libero-ood benchmark, which features 20 tasks extrapolated from standard LIBERO tasks. This reveals that the skill representations encoded in text-latent are individual yet composable, while π0 fails to autonomously combine these representations for extrapolation. This also validates the design of libero-ood; it comprises tasks that the model fails, yet should be able to complete. We then tested other VLAs on libero-ood, and none of them achieved a success rate higher than 21%. Further analysis reveals VLAs share a common pattern to exhibit spatial overfitting, associating object names with where the object is spatially located in the demonstrated scene rather than achieving true object and goal understanding.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 11155

Loading