MetaVLA: Unified Meta Co-Training for Efficient Embodied Adaptation

Published: 23 Sept 2025, Last Modified: 19 Nov 2025SpaVLE PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision–Language–Action Models, New Embodiment Generalization, Robust Action Reasoning, Efficient VLA training
Abstract: Vision–Language–Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists—they often require task-specific fine-tuning, incur high compute costs, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism—derived from Attentive Neural Processes—to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ∼76%. These results show that scalable, low-resource post-training is achievable, paving the way to general-purpose embodied agents.
Submission Type: Short Research Paper (< 4 Pages)
Submission Number: 57
Loading