SpatialVLA-Mamba: Efficient State-Space Models with Self-Refinement for Spatially-Grounded Robotic Control

SpatialVLA-Mamba: Efficient State-Space Models with Self-Refinement for Spatially-Grounded Robotic Control

ICLR 2026 Conference Submission18350 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language-Action Models, Spatial Reasoning, State-Space Models (Mamba), Reinforcement Learning, Robotics and Embodied AI

TL;DR: SpatialVLA-Mamba unifies spatial grounding, efficient Mamba decoding, and intrinsic CoT-RL to achieve precise, robust, and efficient robotic control in simulation.

Abstract: Recent progress in vision-language-action (VLA) models has enabled robots to follow natural language instructions across diverse manipulation tasks. However, existing approaches struggle with three persistent challenges: limited spatial grounding, which hampers centimeter-level precision; inefficiency and instability in long-horizon execution due to transformer-based decoders; and brittleness under distribution shift, where minor visual or linguistic variations can cause failure. We present SpatialVLA-Mamba, a framework that addresses these challenges through three innovations. First, a spatial-aware encoder augments RGB features with depth and geometric primitives, providing explicit metric grounding. Second, a Mamba-based state-space decoder replaces transformers, offering linear-time complexity and stable long-sequence modeling for extended action horizons. Third, a Chain-of-Thought Reinforcement Learning (CoT-RL) loop introduces intrinsic self-refinement: the policy generates textual outcome summaries of candidate trajectories, evaluates them with CLIPScore against the goal instruction, and updates itself via PPO without reliance on external language models. Experiments in Webots show that SpatialVLA-Mamba reduces spatial error by over 35\% relative to strong baselines, improves unseen-task success to 67.3\%, and achieves higher robustness to sensor noise and linguistic paraphrasing, while requiring less GPU memory and runtime. These results highlight the importance of combining spatial grounding, efficient sequence modeling, and intrinsic reasoning for reliable embodied control, pointing toward embodied foundation models that are accurate, efficient, and self-correcting.

Primary Area: applications to robotics, autonomy, planning

Submission Number: 18350

Loading