If You Can Make an Omelette, Can You Crack an Egg? Probing Zero-Shot Subtask Generalization in Vision-Language-Action Models

Published: 01 Feb 2026, Last Modified: 01 Feb 2026CoRL 2025 Workshop LEAP (Early-bird)EveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language-action models, multi-task learning, language annotation
Abstract: Recent robotic vision-language-action (VLA) models have shown impressive zero- and few-shot capabilities when deployed in unseen environments and robot morphologies. While natural language is a convenient way to specify tasks, it remains unclear how reliably VLAs can follow previously unseen language instructions after adaptation to new domains. This capability is particularly important for multi-task settings where collecting data and finetuning models for each potential task is impractical. To investigate this, we evaluate how well VLAs finetuned on a set of high-level tasks (place block in drawer, stack blocks) perform on the constituent low-level subtasks (grasp block, lift grasped block), and compare this to models fine-tuned directly on those subtasks. This evaluation protocol isolates the unseen instruction understanding from the model's physical task execution capabilities. We find that even for bigger VLAs, the performance gap between high-level vs. subtask finetuning does not shrink consistently. Overall, our results indicate that beyond model scaling, fine-grained robot data annotation and appropriate data collection protocols are crucial for improving the multi-task capabilities of existing robotic VLA policies.
Submission Number: 16
Loading