Abstract: A long-standing goal in robotics is a generalist policy that can be deployed zero-shot on new robot
embodiments without per-embodiment adaptation. Despite large-scale multi-embodiment pre-training,
existing Vision–Language–Action models (VLAs) remain tightly coupled to their training embodiments
and typically require costly fine-tuning. We introduce Language-Action Pre-training (LAP), a simple
recipe that represents low-level robot actions directly in natural language, aligning action supervision
with the pre-trained vision–language model’s input–output distribution. LAP requires no learned
tokenizer, no costly annotation, and no embodiment-specific architectural design. Based on LAP, we
present LAP-3B, which to the best of our knowledge is the first VLA to achieve substantial zero-shot
transfer to previously unseen robot embodiments without any embodiment-specific fine-tuning. Across
multiple novel robots and manipulation tasks, LAP-3B attains over 50% average zero-shot success,
delivering roughly a 2× improvement over the strongest prior VLAs. We further show that LAP
enables efficient adaptation and favorable scaling, while unifying action prediction and VQA in a
shared language-action format that yields additional gains through co-training.
Loading