Keywords: Vision-Language-Action Models, Skill Learning, Gradient-Free Adaptation
TL;DR: Learning a linearly composable skill space during VLA pretraining enables 1-shot domain adaptation.
Abstract: Vision-Language-Action (VLA) models trained on large robot datasets promise general-purpose, robust control across diverse domains and embodiments. However, existing approaches often fail out-of-the-box when deployed in novel environments, embodiments, or tasks. We introduce Mixture of Skills VLA (MoS-VLA), a framework that represents robot manipulation policies as linear combinations of a finite set of learned basis functions. During pretraining, MoS-VLA jointly learns these basis functions across datasets from the Open X-Embodiment project, producing a structured skill space. At test time, adapting to a new task requires a single expert demonstration: the corresponding skill representation is inferred via a lightweight convex optimization problem that minimizes action L1 error, without any gradient updates. This gradient-free adaptation incurs minimal overhead while enabling rapid instantiation of new skills. Empirically, MoS-VLA achieves lower action-prediction error on five out of five unseen datasets and succeeds in both simulation and real-robot tasks where a pretrained VLA model fails outright.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 22890
Loading