AbsVLA: Learning Robust Primitive Manipulation Skills for VLA Models in Object-Centric Abstracted States
Keywords: Vision-Language-Action (VLA), Robot Manipulation, Object-Centric Representation, Representation Learning, Robustness, Sim-to-Real Transfer
TL;DR: AbsVLA learns manipulation policies in an object-centric abstract state space, improving robustness and generalization under distribution shifts.
Abstract: We investigate the role of representation abstraction in Vision–Language–Action (VLA) policies for robotic manipulation. While recent VLA models show strong performance on multi-task benchmarks, they often exhibit limited robustness under visual and linguistic distribution shifts, especially when trained on limited demonstrations.
We present AbsVLA, a framework that integrates vision–language grounding with VLA policies to enable manipulation learning in an object-centric abstract state space. Our approach maps language instructions to primitive skills and constructs object-centric observations that suppress appearance variations while preserving task-relevant spatial structure, improving alignment between demonstration and execution distributions.
Experiments on the LIBERO benchmark show that AbsVLA improves robustness under both visual and language perturbations compared to standard VLA baselines, and enables goal-type transfer from object-specified goals to region-specified targets. We further demonstrate zero-shot sim-to-real transfer to a real robot with a different embodiment.
Submission Number: 26
Loading