SteerVLA: Steering Vision-Language-Action Models Toward Effective Long-Tail Driving

07 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language-Action Models, Autonomous Driving, Steerability
Abstract: A fundamental challenge in autonomous driving is the integration of high-level, semantic reasoning for long-tail events with low-level, reactive control for robust driving. While large vision-language models (VLMs) trained on web-scale data offer powerful common-sense reasoning, they lack the grounded, embodied experience necessary for safe vehicle control. Conversely, policies trained on driving data exhibit strong reactive skills, but often fail in novel scenarios that require abstract understanding. We posit that an effective autonomous agent must leverage the world knowledge of VLMs to steer a grounded driving policy, rather than attempting to embed all knowledge into a single monolithic model. To this end, we propose SteerVLA, a hierarchical driving policy composed of a high-level VLM planner and a low-level vision–language–action (VLA) policy. The planner produces fine-grained language commands, which steer a flexible, low-level policy for control. To train these policies, we leverage VLMs to augment existing real-world and simulation data with dense annotations in hindsight, which we find is essential for strong reasoning and steerability. We evaluate SteerVLA in challenging real-world open-loop and simulated closed-loop long-tail scenarios, where it outperforms state-of-the-art methods.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 2762
Loading