Keywords: vision-language-action models; robotic manipulation
Abstract: Vision-language-action (VLA) models trained on large-scale robot demonstration datasets have achieved impressive in-distribution performance, yet they can fail catastrophically under minor domain shifts. For instance, a VLA-trained robot tasked to “pick the red block” may flounder due to various environmental disturbances such as lighting changes or scene clutter. To address this limitation, we propose SpinVLA, a novel end-to-end VLA architecture that leverages the mathematical equivalence between spectral decomposition and contrastive learning to improve robustness. Drawing inspiration from causal inference principles, which suggest that stable features persist across environments, we hypothesize that consistent patterns in successful demonstrations represent task-relevant information rather than spurious correlations, i.e., statistical associations unrelated to the true causal factors of task performance. Our approach integrates spectral decomposition to identify demonstration-consistent features, contrastive learning to enforce representational stability, and efficient low-rank adaptation modules for environment-specific tuning. Extensive experiments using the open-source LIBERO datasets show that SpinVLA significantly improves robotic manipulation task success rates compared to baseline VLAs under visual perturbations and the presence of out-of-distribution objects, while maintaining comparable in-distribution performance.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 19812
Loading