Keywords: Vision Language Action Models, Generalization, Interpretability, Benchmark
TL;DR: VLAs fail to generalize, while it holds the potential to do so.
Abstract: Vision-language-action models (VLAs) often achieve high performance on demon-
strated tasks but struggle significantly when required to extrapolate, recombining
skills used in different tasks in novel ways. For instance, VLAs might successfully
put the cream cheese in the bowl and put the bowl on top of the cabinet, yet still
fail to put the cream cheese on top of the cabinet. This motivates us to investigate
whether VLAs merely overfit to demonstrated tasks or still hold the potential to
extrapolate. Our study uses text latent as the ingredient; it is a task-specific vector
derived from the models’ hidden states. It thus encodes semantics necessary for
completing a task and can be used to reconstruct the associated task behavior by
writing it to the model’s residual stream. Furthermore, we find that skills used
in distinct tasks can be combined to produce novel behaviors by blending their
respective text latent. Applying this to π0, we increase its success rate from 9% to
83% on the proposed libero-ood benchmark, which features 20 tasks extrapolated
from standard LIBERO tasks. This reveals that the skill representations encoded in
text-latent are individual yet composable, while π0 fails to autonomously combine
these representations for extrapolation. This also validates the design of libero-ood;
it comprises tasks that the model fails, yet should be able to complete. We then
tested other VLAs on libero-ood, and none of them achieved a success rate higher
than 21%. Further analysis reveals VLAs share a common pattern to exhibit spatial
overfitting, associating object names with where the object is spatially located in
the demonstrated scene rather than achieving true object and goal understanding.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 11155
Loading