SynthLA: Synthetic Language--Action Policies for Zero-shot Real-world Manipulation via Structured Perception
Keywords: Language-conditioned manipulation, synthetic data generation, structured perception, vision-language-action, large language models, zero-shot real-world manipulation, multi-task robot policies
TL;DR: SynthLA teaches a robot to turn language plus structured scene descriptions into low-level actions using only synthetic training data, enabling zero-shot real-world tabletop manipulation without human demonstrations
Abstract: Synthetic data offers a scalable alternative to costly real-world data collection for robot learning, yet most language-conditioned manipulation systems still rely on large demonstration datasets or predefined primitive libraries. We present SynthLA, a modular framework that learns low-level language-action policies for robot manipulation entirely from synthetic data. Our key idea is to use structured text as an interface between perception and control. A vision model converts RGB-D observations into a textual scene representation, and a fine-tuned large language model predicts the next low-level robot action in closed loop. Training data are generated automatically by randomizing symbolic scenes and labeling them with a deterministic geometric oracle, producing millions of task-state-action triplets without robot demonstrations. We evaluate SynthLA on four real-world tabletop tasks under both in-distribution and out-of-distribution conditions, as well as static and dynamic scenes. Despite being trained exclusively on synthetic supervision, SynthLA achieves a 73\% mean success rate, outperforming a fine-tuned OpenVLA baseline (60\%). Our results demonstrate that structured representations can enable effective sim-to-real transfer, highlighting synthetic data as a practical and scalable catalyst for robust robot manipulation.
Submission Number: 16
Loading